ENS Name Normalization 2nd

Sorry, it’s a programming thing (0-vs-1 based indexing). It’s always been 3rd and 4th characters (think punycode, xn--a...).


I took the remaining unordered scripts and restricted most of them based on the purity of their registrations. The following scripts have considerable registrations, where the second row has nearly all pure registrations:

  1. Hang, Hani, Kana, Hira, Latn, Cyrl
  2. Arab, Deva, Hebr, Grek, Thai, Beng, Taml

I decided to make the 2nd row script-restricted as well. This greatly simplifies the remaining ordering problem, as Latn/Cyrl are pretty distinct from CJK and I can split them based on Hira/Kana vs Hang.
image

Although, its probably hard to sample, I’d imagine some scripts would prefer if their names were pure and unmixable.


Based on registrations, I allow the CJK to be romanized and allow access to unadorned Latin a-z.
image


There are a lot of dot characters but there are at least 4 “middle dot” dots:

0xB7, // (·) MIDDLE DOT (Common) 
0x387, // (·) GREEK ANO TELEIA (Common) => IDNA mapped to B7
0x30FB, // (・) KATAKANA MIDDLE DOT (Common)
0xFF65, // (・) HALFWIDTH KATAKANA MIDDLE DOT => IDNA mapped to 30FB

ContextJ says Middle dot is only allowed between two L’s, l·l (there are also dedicated characters for this Ŀ and ŀ). Only 1 name in ENS uses this form, and it’s by accident (al·la·huak·bar).

There are 167 registrations with B7.
There are 49 registrations with 30FB

Although B7 has more registrations, the full-width () middle dot 30FB seems like the correct one (if any are allowed). Maybe it could use the same rules as apostrophe? Most of the registrations that use it are using it with pure Japanese ペガサス・j・クロフォード. About half the registrations of B7 are Latin or digits (as a prefix: ·555) and the other half are used correctly (豚林·vitalik).

I was intending to enforce ContextJ but I think its better to just disallow B7 (it’s not worth the code complexity.)

6 Likes

I pushed a pretty large update. Resolver | Characters | Emoji

Preliminary error report:
ens_normalize (1.7.0) vs eth-ens-namehash (2.0.15) [1528752 labels] @ 2022-10-24T10:48:41.586Z (4MB)
(The error report code needs improved. It should probably be grouped by error type and split by script.)

I added a new feature ens_split(name): Label[] which basically produces a JSON structure of the information in the screenshot below:

JSON Description
[
  {
    input: [ 49, 65039, 8419, 82, 97, 771, 102, 102, 121, 128169 ],
    offset: 0,
    mapped: [ [ 49, 65039, 8419 ], 114, 97, 771, 102, 102, 121, [ 128169, 65039 ] ],
    output: [ 49, 8419, 114, 227, 102, 102, 121, 128169 ],
    emoji: true,
    script: 'Latin'
  },
  {
    input: [ 101, 116, 104 ],
    offset: 11,
    mapped: [ 101, 116, 104 ],
    output: [ 101, 116, 104 ],
    emoji: false,
    script: 'Latin'
  }
]```

5 Likes

Hey guys, thanks for all the hard work you are doing.
I’m a dev at ens.vision and I have two questions.

We have the Persian 999 club on our website. After the normalization update 973 of the 1000 three-digit names will be invalid and not resolvable. We already advised the Persian community not to renew their names. My question on their behalf is, will they receive a refund?

Also yesterday we listed new clubs of negative numbers after high demand from the community.
These also include negative Arabic numbers (Negative Arabic 99, Negative Arabic 999 and Negative Arabic 10k)
@raffy
Today we noticed that the latest updated to the normalization code, which we are already using, makes these numbers invalid as well (Error: mixed-script Arabic confusable: "-"). Is there a possibility to allow leading and trailing hyphen for Arabic numbers? Because that’s how negative Arabic numbers are written. (Whether it should be leading or trailing is another topic)

Thanks in advance :pray:

Only the core team would be able to speak authoritatively about refunds, so you’ll likely need to wait until they make some sort of announcement.

My guess would be that if the name was not able to be registered on the official ENS manager app/site in the first place, then no refund.

Alright, that would be great.
Persian numbers which will be invalid later can be registered currently.
Thanks.

My question is: was this the best outcome? Should we map all the extended digits for consistency? Or disallow them all instead? From earlier discussion – (oh that was you Octexor), we decided to map just the ones that were pixel identical.

Yes, I applied this change. Please let me know if there are other issues.
image

Yes, that’s the best we can do. Any other method would confuse the users. Because most Persians don’t use the Arabic ٤٥٦ digits, if they have a Persian keyboard.
We will update our Persian club with the new names soon. 34% of the numbers will be shared between Arabic and Persian, which is fine. Just like many words that are shared.

Nice, thank you!

There are some numbers that are still invalid.
The ones that only contain ١ or ٥.
١١١ ١١٥ ١٥١ ١٥٥ ٥١١ ٥١٥ ٥٥١ ٥٥٥
Same is true for Persian:
۱۱۱ ۱۱۵ ۱۵۱ ۱۵۵ ۵۱۱ ۵۱۵ ۵۵۱ ۵۵۵
The error is: whole script Arabic confusable
What is the reasoning behind it?

1 Like

The intention at present is to propose to the DAO to send refunds to anyone whose name was valid under the current normalisation scheme but not under the new one.

3 Likes

A whole script confusable is where the entire label is composed of characters that can look another label using a different script. Based on ordering, Latin got priority: eg. ٥٥٥ vs ooo etc.

Although similar, I agree the following use different scale and baseline in various fonts and platforms so I’ve removed them from the whole-script confusable list:

  • 661 (١) ARABIC-INDIC DIGIT ONE
  • 665 (٥) ARABIC-INDIC DIGIT FIVE
  • 6F5 (۵) EXTENDED ARABIC-INDIC DIGIT
  • 967 (१) DEVANAGARI DIGIT ONE
  • 966 (०) DEVANAGARI DIGIT ZERO

However, 966 and 655 are kinda pushing it. Input would be helpful.

image

Many of these are subjective. I’ve been using this list but I think many are too strict and there are a lot of things missing.


Another example is 4E00 vs 30FC. There are 4000 and 500 registrations respectively. I was thinking about disallowing 4E00 if Japanese (contains a Kana/Hira character).

image

Is there an easy way to allow 665 & 966 only if the other characters as of the same script

Yes 665 $ 966 might be confusable, but the rest of the characters used would correct it

I do know that it can then only produces a few names that could be confusable (matching numbers), would that be acceptable, I don’t know

Yes, but you’d lose the repeated cases: ٥٥٥.eth and ०००.eth

If it happens, it happens, might be the easiest/simplest/best way

1 Like

@raffy It appears I’ve found two confusible names both of which checks out in the resolver tool:

ENS Name Details Unicode Links
السعودية.eth Uses an arabic yeh: ي U+064A [Resolver tool] [Unicode analyzer]
السعودیة.eth Uses a farsi yeh: ی U+06CC [Resolver tool] [Unicode analyzer]

I think the issue stems from the fact that alone the arabic and farsi yeh have slight differences, but in at least some words those differences seem to disappear.

1 Like

Single-script confusables have the same problem as script ordering does: since we have a global namespace, the only way to resolve these situations is to say character A > character B (ie. order them) and disallow the alternatives.

These are the valid single-script confusables for Arabic
// Single: [30B]
0x64B, // (◌ً) ARABIC FATHATAN
0x8F0, // (◌ࣰ) ARABIC OPEN FATHATAN

// Single: [307]
0x6EC, // (◌۬) ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
0x8EA, // (◌࣪) ARABIC TONE ONE DOT ABOVE

// Single: [350]
0x8FF, // (◌ࣿ) ARABIC MARK SIDEWAYS NOON GHUNNA
0x8F8, // (◌ࣸ) ARABIC RIGHT ARROWHEAD ABOVE

// Single: [64C]
0x8F1, // (◌ࣱ) ARABIC OPEN DAMMATAN
0x8E8, // (◌ࣨ) ARABIC CURLY DAMMATAN
0x8E5, // (◌ࣥ) ARABIC CURLY DAMMA

// Single: [6C]
0x661, // (١) ARABIC-INDIC DIGIT ONE
0x627, // (ا) ARABIC LETTER ALEF

// Single: [6F]
0x665, // (٥) ARABIC-INDIC DIGIT FIVE
0x6F5, // (۵) EXTENDED ARABIC-INDIC DIGIT FIVE
0x647, // (ه) ARABIC LETTER HEH
0x6BE, // (ھ) ARABIC LETTER HEH DOACHASHMEE
0x6C1, // (ہ) ARABIC LETTER HEH GOAL
0x6D5, // (ە) ARABIC LETTER AE

// Single: [754]
0x8A9, // (ࢩ) ARABIC LETTER YEH WITH TWO DOTS BELOW AND DOT ABOVE
0x767, // (ݧ) ARABIC LETTER NOON WITH TWO DOTS BELOW

// Single: [62D 654]
0x681, // (ځ) ARABIC LETTER HAH WITH HAMZA ABOVE
0x772, // (ݲ) ARABIC LETTER HAH WITH SMALL ARABIC LETTER TAH ABOVE

// Single: [6A1]
0x8BB, // (ࢻ) ARABIC LETTER AFRICAN FEH
0x8BC, // (ࢼ) ARABIC LETTER AFRICAN QAF

// Single: [6A1 6DB]
0x6A4, // (ڤ) ARABIC LETTER VEH
0x6A8, // (ڨ) ARABIC LETTER QAF WITH THREE DOTS ABOVE

// Single: [643]
0x6A9, // (ک) ARABIC LETTER KEHEH
0x6AA, // (ڪ) ARABIC LETTER SWASH KAF

// Single: [643 6DB]
0x6AD, // (ڭ) ARABIC LETTER NG
0x763, // (ݣ) ARABIC LETTER KEHEH WITH THREE DOTS ABOVE

// Single: [649]
0x6BA, // (ں) ARABIC LETTER NOON GHUNNA
0x8BD, // (ࢽ) ARABIC LETTER AFRICAN NOON
0x64A, // (ي) ARABIC LETTER YEH
0x6CC, // (ی) ARABIC LETTER FARSI YEH
0x6D2, // (ے) ARABIC LETTER YEH BARREE

// Single: [649 615]
0x679, // (ٹ) ARABIC LETTER TTEH
0x6BB, // (ڻ) ARABIC LETTER RNOON

// Single: [649 6DB]
0x67E, // (پ) ARABIC LETTER PEH
0x62B, // (ث) ARABIC LETTER THEH
0x6BD, // (ڽ) ARABIC LETTER NOON WITH THREE DOTS ABOVE
0x6D1, // (ۑ) ARABIC LETTER YEH WITH THREE DOTS BELOW
0x63F, // (ؿ) ARABIC LETTER FARSI YEH WITH THREE DOTS ABOVE

// Single: [649 306]
0x756, // (ݖ) ARABIC LETTER BEH WITH SMALL V
0x6CE, // (ێ) ARABIC LETTER YEH WITH SMALL V

For an example, here are 6F confusables (which are the symbols that looks like Latin “o”) for Arabic (from the above spoiler):

0x665, // (٥) ARABIC-INDIC DIGIT FIVE
0x6F5, // (۵) EXTENDED ARABIC-INDIC DIGIT FIVE
0x647, // (ه) ARABIC LETTER HEH
0x6BE, // (ھ) ARABIC LETTER HEH DOACHASHMEE
0x6C1, // (ہ) ARABIC LETTER HEH GOAL
0x6D5, // (ە) ARABIC LETTER AE

We already have 6F5 mapped to 665. Of the remaining symbols, which one is preferred or which ones aren’t actually confusing?

I would say 665 is visually different from 6BE. 647/6C1/6D5 look the same but are visually distinct from the 655 and 6BE. Only 1 of those 3 would be allowed but I don’t know which one.

  • 647 has 753 registrations
  • 6C1 has 12
  • 6D5 has 1

This makes me think that 6C1 and 6D5 should be disallowed.


For your example, these are Arabic confusables for 649 (which is also Arabic)

0x649, // (ى) ARABIC LETTER ALEF MAKSURA <= Confusable Primary
0x6BA, // (ں) ARABIC LETTER NOON GHUNNA
0x8BD, // (ࢽ) ARABIC LETTER AFRICAN NOON
0x64A, // (ي) ARABIC LETTER YEH
0x6CC, // (ی) ARABIC LETTER FARSI YEH
0x6D2, // (ے) ARABIC LETTER YEH BARREE

  • 649 has 108 registrations
  • 6BA has 0
  • 8BD has 0
  • 64A has 3171
  • 6CC has 122
  • 6D2 has 2

This looks like 3 separate characters to me:

  1. 649, 64A, 6CC (64A has dots and 649/6CC do not)
  2. 6BA, 8BD
  3. 6D2

From this, I would disallow: 649 and 6CC based on registrations. I don’t know how to choose between 6BA and 8BD (both 0 regs).

If those dots make 64A distinct, I would keep 649.


As a potential solution: for each script, I could compute a report of the single-script confusable groups along with their registration counts. I would need users of those scripts to discern if any of those groups should be broken up further, and for each remaining group with 2+ characters, which is the preferred character. The end result is simple: non-primary single-script confusables must be disallowed.

Just to clarify: imagine the name “XY” where X and Y are single-script confusable. If you simply enforce both X and Y can’t be used together, then “XX” and “YY” would be valid but since confusable means X looks like Y, that also means “XX” looks like “YY”, thus only one must be allowed (unless they weren’t confusable in the first place.)

3 Likes

Quick update:

  • I was able to add a bunch of the single-script confusables myself using registrations as a guide. I will provide a list of the characters that I need help resolving.
  • I should have updated error reports very soon.
  • I’ve changed “fraction slash” (½.eth) function like Apostrophe (can’t lead, trail, or touch) – but maybe it should just be disabled? It is legal in UTS-46 but clearly looks like / when not touching digits (or with an nonsupporting font).
  • New Tool: Recent 1000 ENS Registrations w/r/t/ Normalization

I discovered an issue when updating the ENSIP docs to match my implementation. I am working on a solution. Sorry for the delay.

4 Likes

I think I’ve got my head around the problem. The following is a little scuffed as I’m in the middle of it, but any help on the following problem would be greatly appreciated. Otherwise, I must allow or disallow ALL of these characters.


For example, these are 2 Hebrew characters are confusable. I need to pick a subset of them. The purple integer is the number of registered names using that character.
image

It would also be useful to know if those names were valid with ENSIP-1, or approximately normalized with latest code, or how the character is used, it’s neighbors, or if it’s appearance changes when combined with other characters (even duplicates of itself) – but it’s difficult to present all this information.

In the config file, this corresponds to the following entry for the “l” confusable:

The first question is: can any of these Hebrew characters stand on their own against other “l” like characters in other scripts that have been marked as primary(*). A non-primary confusable must have another non-confusing characters to be allowed. ASCII is default primary.

The second question is: of the characters that aren’t primary, which is preferred? some? both? or neither? I use a second annotation allow(*) which enables this case.

This file is a little confusing because it exists before any ENS rules have been applied, so it contains extra stuff. For convenience, I’ve commented out every confusable which correspond to a disallowed IDNA 2003 character.

Possible answers to the above example would be: “both are bad” / “5D5 is preferred” / “they don’t look confusable to me”, etc.

4 Likes

I don’t know Hebrew. But I think both should be allowed, as long as the name is single-script. Same goes for the other examples in that list, like く or ノ or へ (which would block out a lot of legitimate words if disallowed).

The users of those languages around the world must already be intimately aware of any possible confusables, same as we in the English world are very aware of “I” vs “l” or “m” vs “rn” and so on.

As long as the name is contained to a single script then I think that’s fine.

3 Likes

I agree with this.

May I suggest that names would be disqualified for a refund if:

  • have been traded
  • sold for a profit
  • has been used for scams, wallet drains, exploits, etc…
  • Names attached to any addresses that are blacklisted by USDT or USDC or any other assets due to misuse or unauthorized behavior or actions on chain.
    ( i know that would probably be a tough list to accumulate )
3 Likes

In this case the refund would go to the current owner; the original owner has already been paid.

These would be extremely difficult to adjudicate.

3 Likes

I’m guessing that they will also only get a refund of the minting price and not what they paid on the secondary market