ENS Name Normalization 2nd

Ronald · October 17, 2022, 5:12am

So to clarify, would -0-.eth be invalid, but -00-.eth would be?

I forgot about the leading underscores, is that still just leading?

After thinking about this for some weeks, I really would think everything through with the mappings/groupings. I can think of weird things, like in Dutch ij is one letter, and their keyboard has it as one letter, but everyone Dutch doesn’t want a Dutch keyboard because everyone speaks English. I remember learning Dutch for a year after I lost my phone in Amsterdam in a taxi. I replaced it with the same one, except the Windows CE version it was in was locked to Dutch language. It was a lot of money, so I just learned the Dutch UI instead of buying another phone. My Dutch friends laughed a lot. Microsoft didn’t get that localizing for Dutch meant putting things in English.

The context of that long story matters because even in that localization there wasn’t ij as a letter in the primary keyboard, it was an alternate still, so I always just typed the two letters. This kind of scenario where you can spell things both ways can be infuriating when mapping languages to specific keysets, or making word lists.

I write in Spanish often (poorly), but I never use accents or the ñ because I have an English keyboard. In fact, the English speaking city where I crew up had a street name in Spanish with an ñ, and 30 years on, now it’s replaced with just an n, and everyone says the street name wrong. Kind of like how everyone says Crimea wrong today in English (Hint: it never rhymed with crime before the 2000s). A simple gap in history, and new history was made. I hope the same thing doesn’t happen to more obscure languages just because someone not familiar with the language makes a booboo. I’m not going to take any political stance here, but I do think perception has shifted based on the pronouncing of Crimea, and writing things as Karabakh instead of Artsakh. Almost nobody means harm doing it, but yet ignorance got in there and massaged history for the eyes of outsiders.

Æ and Œ are both valid letters in English. Just because English changed to omit these letters most of the time does not make them less valid. It’s similar to church Russian or old Russian. Ij in Dutch has kind of one the way of Æ and Œ, but in our lifetime. It’s why it was already relegated on the keyboard circa 2006. Should Œ map to oe? Are they the same?

I know how to pronounce them, but I have no idea how to even argue for what to do if there were a mapping of just English letters. Who decides the English alphabet? Is it all of the English words that ever existed? Is it just what is currently taught in schools? What about dialects of English that have weird digraphs?

raffy · October 17, 2022, 5:45am

Both are valid but 00--.eth is invalid because it has -- at 2,3.

Yes. Allowed(__ab.eth) and Disallowed(a_b.eth)

There’s a lot of digraph/ligatures in Latin (50+ lowercase). In general, I think they’re all bad for ENS. I’ve kept a few literal ones ĳ/ﬀ/ﬂ/… (which are both mapped to their component letters by IDNA) and æ (which stays as one) but I disallowed rest. ﬅ/ﬆ/ꝡ/ꜳ/… seem bad (mapped or valid.)

I’m about to deploy an update with script-based combining-mark whitelists. This enables stuff like “with dot above” on specific Latin characters w/o enabling it on characters like “i” (which is pure insanity.) This also allows whitelisting multiple-CMs at the character-level (since the general rule allows at most one).

For your example, tilde is only allowed over a, e, n, and o. Tilde variations like middle, double-middle, vertical, below, overlay, etc. are all disabled.

For Greek/Cyrillic/Common, I’ve used the exemplars, for Latin, I’ve picked a subset of exemplars at my own discretion. For example, I’m pretty sure we don’t need all these variations: ā a᷆ a᷇ but these are great: é ñ ç.

From the Isolated characters I’ve collected, they all turned out to be Common-script, so I’ve now retired the Isolated concept (characters that should never get CMs) as I get that for free with this approach.

raffy · October 17, 2022, 7:24am

With Latin cleaned up, many of the silly Greek and Cyrillic whole-script confusables disappeared. I also made an effort to remove all of the Greek characters from Latin (like ɑ α, ɩ ι, ẟ δ, etc.). Similar to the ξ and π change (relax script to Common), I think it would be reasonable to do the same to any of α(a), β(ß), γ(y), σ(o), ω(w), τ(t) (I put the nearest Latin confusable in parentheses.)

For example, the Greek whole-script confusables have simplified to a very reasonable set.

Greek Confusables

[
// 61 (a) LATIN SMALL LETTER A (Latn)
0x3B1, // 3B1 (α) GREEK SMALL LETTER ALPHA (maybe)
// 69 (i) LATIN SMALL LETTER I (Latn)
0x3B9, // 3B9 (ι) GREEK SMALL LETTER IOTA
// 6A (j) LATIN SMALL LETTER J (Latn)
0x3F3, // 3F3 (ϳ) GREEK LETTER YOT
// 6F (o) LATIN SMALL LETTER O (Latn)
0x3BF, // 3BF (ο) GREEK SMALL LETTER OMICRON
0x3C3, // 3C3 (σ) GREEK SMALL LETTER SIGMA (maybe)
// 70 (p) LATIN SMALL LETTER P (Latn)
0x3C1, // 3C1 (ρ) GREEK SMALL LETTER RHO
// 72 (r) LATIN SMALL LETTER R (Latn)
0x1D26, // 1D26 (ᴦ) GREEK LETTER SMALL CAPITAL GAMMA
// DF (ß) LATIN SMALL LETTER SHARP S (Latn)
0x3B2, // 3B2 (β) GREEK SMALL LETTER BETA
// 75 (u) LATIN SMALL LETTER U (Latn)
0x3C5, // 3C5 (υ) GREEK SMALL LETTER UPSILON
// 76 (v) LATIN SMALL LETTER V (Latn)
0x3BD, // 3BD (ν) GREEK SMALL LETTER NU
// 79 (y) LATIN SMALL LETTER Y (Latn)
0x3B3, // 3B3 (γ) GREEK SMALL LETTER GAMMA
]

The main issue with confusables is that I don’t know the ordering of all scripts. At the moment, I have 3 ordered scripts (Latin, Greek, Cyrillic), 133 scripts are tagged as Restricted (which don’t interact with anything else) and 25 scripts that need decided.

Why an ordering?

Latin "o" confuses with Greek "ο" — which one gets ooo.eth?
Greek "φ" confuses with Cyrillic "ф" — which one gets φφφ.eth?

Latin confuses with nothing, Greek confuses with Latin, and Cyrillic confuses with Latin and Greek, and the Restricted scripts confuse with everything but themselves. At the moment, the unordered scripts don’t confuse with anything.

Scripts that Need Ordered or Restricted

1 Armn Armenian
2 Arab Arabic
3 Thaa Thaana
4 Deva Devanagari
5 Beng Bengali
6 Guru Gurmukhi
7 Gujr Gujarati
8 Orya Oriya
9 Taml Tamil
10 Telu Telugu
11 Knda Kannada
12 Mlym Malayalam
13 Sinh Sinhala
14 Thai Thai
15 Laoo Lao
16 Tibt Tibetan
17 Mymr Myanmar
18 Geor Georgian
19 Hang Hangul
20 Ethi Ethiopic
21 Khmr Khmer
22 Hira Hiragana
23 Kana Katakana
24 Bopo Bopomofo
25 Hani Han

As I commented above, I made Hebrew restricted. I looked at all the registered names with Hebrew and they’re almost entirely single-script names. If we keep this classification, it would be possible to give Hebrew it’s own hyphen (Maqaf ־) and punctuation (Geresh ׳ and Gershayim ״). However, as restricted, it can’t mix ASCII or can’t use its own currency symbol (₪) although that could be fixed.

A question was asked about ß vs ss. In UTS-46, ß is a deviation character. Depending on which IDNA version used, it either stays ß or gets mapped to ss. What’s really weird is that capital ẞ isn’t a deviation and is hard-mapped to ss. Since ENSIP-1 allowed ZWJ (a deviation), I’ve assume deviations are valid. Note: browsers seem to be at toss up on this, Safari and Firefox keep it valid, Brave and many online punycoders map it. For registrations, people use it as a "b" like ßinance and ßoobs but it’s also used properly in weißbier and philippstöß.

For Common script, the main thing remaining is organizing the shapes, stars, crosses, and arrow symbols (there’s zillions of them spread all over the place.) These could probably remain as-is but they’re effectively useless until they’re reduced to distinct well-supported choices (like emoji.)

I know some of the emoji enthusiasts have developed various lists but it would be great to have a breakdown of emoji that render pixel-identical ( vs ), -nearly identical ( vs ), or mask-identical ( vs ) on various platforms. Given an emoji, it would nice to know how closely you should inspect it.

Ronald · October 17, 2022, 2:07pm

Can’t approve more of the thought you are putting into this. I wish more people would be as excited at solving this immensely complex problem that will benefit everyone. I want to point to this thread as proof of why ENS > DNS. It’s not just decentralizing or semi-decentralizing domain names, it’s about solving the problems DNS couldn’t and bringing the world together to avoid conflict, while at the same time preserving culture/language/intellectual diversity.

I know some of the emoji enthusiasts have developed various lists but it would be great to have a breakdown of emoji that render pixel-identical ( vs ), -nearly identical ( vs ), or mask-identical ( vs ) on various platforms. Given an emoji, it would nice to know how closely you should inspect it.

I have an idea for this for the scanner. Each emoji should have alt text. You give each emoji a numerical UID and you could also give it an English name for lookup on docs. So when a user hovers over it in a scanner they can see the UID. Of course the token ID works for devs, but a succinct number is best (which is why digits are so popular now in ENS).

As I commented above, I made Hebrew restricted. I looked at all the registered names with Hebrew and they’re almost entirely single-script names. If we keep this classification, it would be possible to give Hebrew it’s own hyphen (Maqaf ־ ) and punctuation (Geresh ׳ and Gershayim ״ ). However, as restricted, it can’t mix ASCII or can’t use its own currency symbol (₪) although that could be fixed.

Hebrew and Arabic are languages deeply tied into the culture and religion (although Arabic predates Islam and it’s also important in other religions). They both have diacritical marks important to religious texts. Hebrew has pointed letters that are usually omitted, but it’s like omitting ë from Russian names. You can’t map ë to e. This is why everyone say Khruschev wrong. His name has a ë and it’s actually pronounced Hruschove which sounds like shove, not shev. This applies to many names that are butchered needlessly in English from poor transliteration, that stems from a poor understanding of English dialects and pronunciations.

My grandfather’s name was Saul. In Yiddish and Hebrew it is שאול with diacritics and שָׁאוּל. The first letter is shin, which can be s or sh depending on where the normally omitted point on the letter is. His mother (who only spoke Yiddish) called him Saul like in Better Call Saul even though it’s actually pronounced Shaul in both Hebrew and Yiddish. Why? Because in the Russian Empire (now claimed by the state of Ukraine) the Yiddish speaking areas had to transliterate names into Russian (old Russian). Births in shtetls were registered in both Yiddish and Russian, and like Khruschev, Saul got mistransliterated to Саул, so people just gave up trying to say it right, much like politicization of the pronunciation of various terms in Russian/Ukrainian.

I had the same problem with my daughter. She was born in a Latino country, but is entitled to Ukrainian, Russian, and Israeli passports. She has a name with a sound that doesn’t exist in Russian. She was also named something English in a Spanish speaking country, and there were multiple transliterations into Cyrillic going on. Name normalization in the Cyrillic (another word pronounced wrong now in English - used to be a hard K, unlike Celtics which Americans amusingly say historically correctly) passports means that not only did need a waiver to translate from English transliteration, it meant that the transliteration back to ASCII from Cyrillic (all passports need ASCII) needed a waiver as well so it would match her birth certificate’s English name.

This was not a small deal, it took almost an hour to solve and wrap the head around, because the computer systems were not made to solve this problem automatically. It took humans to say that the computer was wrong, and to override. I provide details of these things as a warning to the real world use of normalization and mapping. It can be ugly in some spots. I won’t get into the the Americanization of Dutch surnames this time. This crime happened recently because of primitive databases.

One more question on the hyphens. Is the trailing hyphen out now too? So _qwerty_.eth fails?

Theth.eth · October 18, 2022, 12:57am

Your wording has changed on this

It used to be position 3 & 4

1234

Now it is 2 & 3

0123

Are you going to use 0123 from now on?

@Ronald

I’m guessing you are meaning trailing underscores?? They are easy to confuse

Underscores are only going through if they are leading

You can have multiple leading underscores

You can not have underscores in the middle of a name

You can not have underscores trailing

raffy · October 20, 2022, 10:13am

Sorry, it’s a programming thing (0-vs-1 based indexing). It’s always been 3rd and 4th characters (think punycode, xn--a...).

I took the remaining unordered scripts and restricted most of them based on the purity of their registrations. The following scripts have considerable registrations, where the second row has nearly all pure registrations:

Hang, Hani, Kana, Hira, Latn, Cyrl
Arab, Deva, Hebr, Grek, Thai, Beng, Taml

I decided to make the 2nd row script-restricted as well. This greatly simplifies the remaining ordering problem, as Latn/Cyrl are pretty distinct from CJK and I can split them based on Hira/Kana vs Hang.

Although, its probably hard to sample, I’d imagine some scripts would prefer if their names were pure and unmixable.

Based on registrations, I allow the CJK to be romanized and allow access to unadorned Latin a-z.

There are a lot of dot characters but there are at least 4 “middle dot” dots:

0xB7, // (·) MIDDLE DOT (Common) 
0x387, // (·) GREEK ANO TELEIA (Common) => IDNA mapped to B7
0x30FB, // (・) KATAKANA MIDDLE DOT (Common)
0xFF65, // (･) HALFWIDTH KATAKANA MIDDLE DOT => IDNA mapped to 30FB

ContextJ says Middle dot is only allowed between two L’s, l·l (there are also dedicated characters for this Ŀ and ŀ). Only 1 name in ENS uses this form, and it’s by accident (al·la·huak·bar).

There are 167 registrations with B7.
There are 49 registrations with 30FB

Although B7 has more registrations, the full-width (・) middle dot 30FB seems like the correct one (if any are allowed). Maybe it could use the same rules as apostrophe? Most of the registrations that use it are using it with pure Japanese ペガサス・j・クロフォード. About half the registrations of B7 are Latin or digits (as a prefix: ·555) and the other half are used correctly (豚林·vitalik).

I was intending to enforce ContextJ but I think its better to just disallow B7 (it’s not worth the code complexity.)

raffy · October 24, 2022, 11:30am

I pushed a pretty large update. Resolver | Characters | Emoji

Preliminary error report:
ens_normalize (1.7.0) vs eth-ens-namehash (2.0.15) [1528752 labels] @ 2022-10-24T10:48:41.586Z (4MB)
(The error report code needs improved. It should probably be grouped by error type and split by script.)

I added a new feature ens_split(name): Label[] which basically produces a JSON structure of the information in the screenshot below:

JSON Description

[
  {
    input: [ 49, 65039, 8419, 82, 97, 771, 102, 102, 121, 128169 ],
    offset: 0,
    mapped: [ [ 49, 65039, 8419 ], 114, 97, 771, 102, 102, 121, [ 128169, 65039 ] ],
    output: [ 49, 8419, 114, 227, 102, 102, 121, 128169 ],
    emoji: true,
    script: 'Latin'
  },
  {
    input: [ 101, 116, 104 ],
    offset: 11,
    mapped: [ 101, 116, 104 ],
    output: [ 101, 116, 104 ],
    emoji: false,
    script: 'Latin'
  }
]```

Octexor · October 25, 2022, 12:48am

Hey guys, thanks for all the hard work you are doing.
I’m a dev at ens.vision and I have two questions.

We have the Persian 999 club on our website. After the normalization update 973 of the 1000 three-digit names will be invalid and not resolvable. We already advised the Persian community not to renew their names. My question on their behalf is, will they receive a refund?

Also yesterday we listed new clubs of negative numbers after high demand from the community.
These also include negative Arabic numbers (Negative Arabic 99, Negative Arabic 999 and Negative Arabic 10k)
@raffy
Today we noticed that the latest updated to the normalization code, which we are already using, makes these numbers invalid as well (Error: mixed-script Arabic confusable: "-"). Is there a possibility to allow leading and trailing hyphen for Arabic numbers? Because that’s how negative Arabic numbers are written. (Whether it should be leading or trailing is another topic)

Thanks in advance

serenae · October 25, 2022, 1:06am

Only the core team would be able to speak authoritatively about refunds, so you’ll likely need to wait until they make some sort of announcement.

My guess would be that if the name was not able to be registered on the official ENS manager app/site in the first place, then no refund.

Octexor · October 25, 2022, 1:10am

Alright, that would be great.
Persian numbers which will be invalid later can be registered currently.
Thanks.

raffy · October 25, 2022, 4:28am

My question is: was this the best outcome? Should we map all the extended digits for consistency? Or disallow them all instead? From earlier discussion – (oh that was you Octexor), we decided to map just the ones that were pixel identical.

Yes, I applied this change. Please let me know if there are other issues.

Octexor · October 25, 2022, 8:15am

Yes, that’s the best we can do. Any other method would confuse the users. Because most Persians don’t use the Arabic ٤٥٦ digits, if they have a Persian keyboard.
We will update our Persian club with the new names soon. 34% of the numbers will be shared between Arabic and Persian, which is fine. Just like many words that are shared.

Nice, thank you!

There are some numbers that are still invalid.
The ones that only contain ١ or ٥.
١١١ ١١٥ ١٥١ ١٥٥ ٥١١ ٥١٥ ٥٥١ ٥٥٥
Same is true for Persian:
۱۱۱ ۱۱۵ ۱۵۱ ۱۵۵ ۵۱۱ ۵۱۵ ۵۵۱ ۵۵۵
The error is: whole script Arabic confusable
What is the reasoning behind it?

nick.eth · October 25, 2022, 11:22pm

The intention at present is to propose to the DAO to send refunds to anyone whose name was valid under the current normalisation scheme but not under the new one.

raffy · October 26, 2022, 5:58am

A whole script confusable is where the entire label is composed of characters that can look another label using a different script. Based on ordering, Latin got priority: eg. ٥٥٥ vs ooo etc.

Although similar, I agree the following use different scale and baseline in various fonts and platforms so I’ve removed them from the whole-script confusable list:

661 (١) ARABIC-INDIC DIGIT ONE
665 (٥) ARABIC-INDIC DIGIT FIVE
6F5 (۵) EXTENDED ARABIC-INDIC DIGIT
967 (१) DEVANAGARI DIGIT ONE
966 (०) DEVANAGARI DIGIT ZERO

However, 966 and 655 are kinda pushing it. Input would be helpful.

Many of these are subjective. I’ve been using this list but I think many are too strict and there are a lot of things missing.

Another example is 4E00 vs 30FC. There are 4000 and 500 registrations respectively. I was thinking about disallowing 4E00 if Japanese (contains a Kana/Hira character).

Theth.eth · October 27, 2022, 12:11am

Is there an easy way to allow 665 & 966 only if the other characters as of the same script

Yes 665 $ 966 might be confusable, but the rest of the characters used would correct it

I do know that it can then only produces a few names that could be confusable (matching numbers), would that be acceptable, I don’t know

raffy · October 27, 2022, 3:01am

Yes, but you’d lose the repeated cases: ٥٥٥.eth and ०००.eth

Theth.eth · October 28, 2022, 12:59am

If it happens, it happens, might be the easiest/simplest/best way

cthulu.eth · October 29, 2022, 2:27pm

@raffy It appears I’ve found two confusible names both of which checks out in the resolver tool:

ENS Name	Details		Unicode	Links
السعودية.eth	Uses an arabic yeh:	ي	`U+064A`	[Resolver tool] [Unicode analyzer]
السعودیة.eth	Uses a farsi yeh:	ی	`U+06CC`	[Resolver tool] [Unicode analyzer]

I think the issue stems from the fact that alone the arabic and farsi yeh have slight differences, but in at least some words those differences seem to disappear.

raffy · October 30, 2022, 4:01am

Single-script confusables have the same problem as script ordering does: since we have a global namespace, the only way to resolve these situations is to say character A > character B (ie. order them) and disallow the alternatives.

These are the valid single-script confusables for Arabic

// Single: [30B]
0x64B, // (◌ً) ARABIC FATHATAN
0x8F0, // (◌ࣰ) ARABIC OPEN FATHATAN

// Single: [307]
0x6EC, // (◌۬) ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
0x8EA, // (◌࣪) ARABIC TONE ONE DOT ABOVE

// Single: [350]
0x8FF, // (◌ࣿ) ARABIC MARK SIDEWAYS NOON GHUNNA
0x8F8, // (◌ࣸ) ARABIC RIGHT ARROWHEAD ABOVE

// Single: [64C]
0x8F1, // (◌ࣱ) ARABIC OPEN DAMMATAN
0x8E8, // (◌ࣨ) ARABIC CURLY DAMMATAN
0x8E5, // (◌ࣥ) ARABIC CURLY DAMMA

// Single: [6C]
0x661, // (١) ARABIC-INDIC DIGIT ONE
0x627, // (ا) ARABIC LETTER ALEF

// Single: [6F]
0x665, // (٥) ARABIC-INDIC DIGIT FIVE
0x6F5, // (۵) EXTENDED ARABIC-INDIC DIGIT FIVE
0x647, // (ه) ARABIC LETTER HEH
0x6BE, // (ھ) ARABIC LETTER HEH DOACHASHMEE
0x6C1, // (ہ) ARABIC LETTER HEH GOAL
0x6D5, // (ە) ARABIC LETTER AE

// Single: [754]
0x8A9, // (ࢩ) ARABIC LETTER YEH WITH TWO DOTS BELOW AND DOT ABOVE
0x767, // (ݧ) ARABIC LETTER NOON WITH TWO DOTS BELOW

// Single: [62D 654]
0x681, // (ځ) ARABIC LETTER HAH WITH HAMZA ABOVE
0x772, // (ݲ) ARABIC LETTER HAH WITH SMALL ARABIC LETTER TAH ABOVE

// Single: [6A1]
0x8BB, // (ࢻ) ARABIC LETTER AFRICAN FEH
0x8BC, // (ࢼ) ARABIC LETTER AFRICAN QAF

// Single: [6A1 6DB]
0x6A4, // (ڤ) ARABIC LETTER VEH
0x6A8, // (ڨ) ARABIC LETTER QAF WITH THREE DOTS ABOVE

// Single: [643]
0x6A9, // (ک) ARABIC LETTER KEHEH
0x6AA, // (ڪ) ARABIC LETTER SWASH KAF

// Single: [643 6DB]
0x6AD, // (ڭ) ARABIC LETTER NG
0x763, // (ݣ) ARABIC LETTER KEHEH WITH THREE DOTS ABOVE

// Single: [649]
0x6BA, // (ں) ARABIC LETTER NOON GHUNNA
0x8BD, // (ࢽ) ARABIC LETTER AFRICAN NOON
0x64A, // (ي) ARABIC LETTER YEH
0x6CC, // (ی) ARABIC LETTER FARSI YEH
0x6D2, // (ے) ARABIC LETTER YEH BARREE

// Single: [649 615]
0x679, // (ٹ) ARABIC LETTER TTEH
0x6BB, // (ڻ) ARABIC LETTER RNOON

// Single: [649 6DB]
0x67E, // (پ) ARABIC LETTER PEH
0x62B, // (ث) ARABIC LETTER THEH
0x6BD, // (ڽ) ARABIC LETTER NOON WITH THREE DOTS ABOVE
0x6D1, // (ۑ) ARABIC LETTER YEH WITH THREE DOTS BELOW
0x63F, // (ؿ) ARABIC LETTER FARSI YEH WITH THREE DOTS ABOVE

// Single: [649 306]
0x756, // (ݖ) ARABIC LETTER BEH WITH SMALL V
0x6CE, // (ێ) ARABIC LETTER YEH WITH SMALL V

For an example, here are 6F confusables (which are the symbols that looks like Latin “o”) for Arabic (from the above spoiler):

0x665, // (٥) ARABIC-INDIC DIGIT FIVE
0x6F5, // (۵) EXTENDED ARABIC-INDIC DIGIT FIVE
0x647, // (ه) ARABIC LETTER HEH
0x6BE, // (ھ) ARABIC LETTER HEH DOACHASHMEE
0x6C1, // (ہ) ARABIC LETTER HEH GOAL
0x6D5, // (ە) ARABIC LETTER AE

We already have 6F5 mapped to 665. Of the remaining symbols, which one is preferred or which ones aren’t actually confusing?

I would say 665 is visually different from 6BE. 647/6C1/6D5 look the same but are visually distinct from the 655 and 6BE. Only 1 of those 3 would be allowed but I don’t know which one.

647 has 753 registrations
6C1 has 12
6D5 has 1

This makes me think that 6C1 and 6D5 should be disallowed.

For your example, these are Arabic confusables for 649 (which is also Arabic)

0x649, // (ى) ARABIC LETTER ALEF MAKSURA <= Confusable Primary
0x6BA, // (ں) ARABIC LETTER NOON GHUNNA
0x8BD, // (ࢽ) ARABIC LETTER AFRICAN NOON
0x64A, // (ي) ARABIC LETTER YEH
0x6CC, // (ی) ARABIC LETTER FARSI YEH
0x6D2, // (ے) ARABIC LETTER YEH BARREE

649 has 108 registrations
6BA has 0
8BD has 0
64A has 3171
6CC has 122
6D2 has 2

This looks like 3 separate characters to me:

649, 64A, 6CC (64A has dots and 649/6CC do not)
6BA, 8BD
6D2

From this, I would disallow: 649 and 6CC based on registrations. I don’t know how to choose between 6BA and 8BD (both 0 regs).

If those dots make 64A distinct, I would keep 649.

As a potential solution: for each script, I could compute a report of the single-script confusable groups along with their registration counts. I would need users of those scripts to discern if any of those groups should be broken up further, and for each remaining group with 2+ characters, which is the preferred character. The end result is simple: non-primary single-script confusables must be disallowed.

Just to clarify: imagine the name “XY” where X and Y are single-script confusable. If you simply enforce both X and Y can’t be used together, then “XX” and “YY” would be valid but since confusable means X looks like Y, that also means “XX” looks like “YY”, thus only one must be allowed (unless they weren’t confusable in the first place.)

raffy · November 2, 2022, 2:56am

Quick update:

I was able to add a bunch of the single-script confusables myself using registrations as a guide. I will provide a list of the characters that I need help resolving.
I should have updated error reports very soon.
I’ve changed “fraction slash” (½.eth) function like Apostrophe (can’t lead, trail, or touch) – but maybe it should just be disabled? It is legal in UTS-46 but clearly looks like / when not touching digits (or with an nonsupporting font).
New Tool: Recent 1000 ENS Registrations w/r/t/ Normalization

I discovered an issue when updating the ENSIP docs to match my implementation. I am working on a solution. Sorry for the delay.