ENS Name Normalization 2nd

The intention at present is to propose to the DAO to send refunds to anyone whose name was valid under the current normalisation scheme but not under the new one.

3 Likes

A whole script confusable is where the entire label is composed of characters that can look another label using a different script. Based on ordering, Latin got priority: eg. ٥٥٥ vs ooo etc.

Although similar, I agree the following use different scale and baseline in various fonts and platforms so I’ve removed them from the whole-script confusable list:

  • 661 (١) ARABIC-INDIC DIGIT ONE
  • 665 (٥) ARABIC-INDIC DIGIT FIVE
  • 6F5 (۵) EXTENDED ARABIC-INDIC DIGIT
  • 967 (१) DEVANAGARI DIGIT ONE
  • 966 (०) DEVANAGARI DIGIT ZERO

However, 966 and 655 are kinda pushing it. Input would be helpful.

image

Many of these are subjective. I’ve been using this list but I think many are too strict and there are a lot of things missing.


Another example is 4E00 vs 30FC. There are 4000 and 500 registrations respectively. I was thinking about disallowing 4E00 if Japanese (contains a Kana/Hira character).

image

Is there an easy way to allow 665 & 966 only if the other characters as of the same script

Yes 665 $ 966 might be confusable, but the rest of the characters used would correct it

I do know that it can then only produces a few names that could be confusable (matching numbers), would that be acceptable, I don’t know

Yes, but you’d lose the repeated cases: ٥٥٥.eth and ०००.eth

If it happens, it happens, might be the easiest/simplest/best way

1 Like

@raffy It appears I’ve found two confusible names both of which checks out in the resolver tool:

ENS Name Details Unicode Links
السعودية.eth Uses an arabic yeh: ي U+064A [Resolver tool] [Unicode analyzer]
السعودیة.eth Uses a farsi yeh: ی U+06CC [Resolver tool] [Unicode analyzer]

I think the issue stems from the fact that alone the arabic and farsi yeh have slight differences, but in at least some words those differences seem to disappear.

1 Like

Single-script confusables have the same problem as script ordering does: since we have a global namespace, the only way to resolve these situations is to say character A > character B (ie. order them) and disallow the alternatives.

These are the valid single-script confusables for Arabic
// Single: [30B]
0x64B, // (◌ً) ARABIC FATHATAN
0x8F0, // (◌ࣰ) ARABIC OPEN FATHATAN

// Single: [307]
0x6EC, // (◌۬) ARABIC ROUNDED HIGH STOP WITH FILLED CENTRE
0x8EA, // (◌࣪) ARABIC TONE ONE DOT ABOVE

// Single: [350]
0x8FF, // (◌ࣿ) ARABIC MARK SIDEWAYS NOON GHUNNA
0x8F8, // (◌ࣸ) ARABIC RIGHT ARROWHEAD ABOVE

// Single: [64C]
0x8F1, // (◌ࣱ) ARABIC OPEN DAMMATAN
0x8E8, // (◌ࣨ) ARABIC CURLY DAMMATAN
0x8E5, // (◌ࣥ) ARABIC CURLY DAMMA

// Single: [6C]
0x661, // (١) ARABIC-INDIC DIGIT ONE
0x627, // (ا) ARABIC LETTER ALEF

// Single: [6F]
0x665, // (٥) ARABIC-INDIC DIGIT FIVE
0x6F5, // (۵) EXTENDED ARABIC-INDIC DIGIT FIVE
0x647, // (ه) ARABIC LETTER HEH
0x6BE, // (ھ) ARABIC LETTER HEH DOACHASHMEE
0x6C1, // (ہ) ARABIC LETTER HEH GOAL
0x6D5, // (ە) ARABIC LETTER AE

// Single: [754]
0x8A9, // (ࢩ) ARABIC LETTER YEH WITH TWO DOTS BELOW AND DOT ABOVE
0x767, // (ݧ) ARABIC LETTER NOON WITH TWO DOTS BELOW

// Single: [62D 654]
0x681, // (ځ) ARABIC LETTER HAH WITH HAMZA ABOVE
0x772, // (ݲ) ARABIC LETTER HAH WITH SMALL ARABIC LETTER TAH ABOVE

// Single: [6A1]
0x8BB, // (ࢻ) ARABIC LETTER AFRICAN FEH
0x8BC, // (ࢼ) ARABIC LETTER AFRICAN QAF

// Single: [6A1 6DB]
0x6A4, // (ڤ) ARABIC LETTER VEH
0x6A8, // (ڨ) ARABIC LETTER QAF WITH THREE DOTS ABOVE

// Single: [643]
0x6A9, // (ک) ARABIC LETTER KEHEH
0x6AA, // (ڪ) ARABIC LETTER SWASH KAF

// Single: [643 6DB]
0x6AD, // (ڭ) ARABIC LETTER NG
0x763, // (ݣ) ARABIC LETTER KEHEH WITH THREE DOTS ABOVE

// Single: [649]
0x6BA, // (ں) ARABIC LETTER NOON GHUNNA
0x8BD, // (ࢽ) ARABIC LETTER AFRICAN NOON
0x64A, // (ي) ARABIC LETTER YEH
0x6CC, // (ی) ARABIC LETTER FARSI YEH
0x6D2, // (ے) ARABIC LETTER YEH BARREE

// Single: [649 615]
0x679, // (ٹ) ARABIC LETTER TTEH
0x6BB, // (ڻ) ARABIC LETTER RNOON

// Single: [649 6DB]
0x67E, // (پ) ARABIC LETTER PEH
0x62B, // (ث) ARABIC LETTER THEH
0x6BD, // (ڽ) ARABIC LETTER NOON WITH THREE DOTS ABOVE
0x6D1, // (ۑ) ARABIC LETTER YEH WITH THREE DOTS BELOW
0x63F, // (ؿ) ARABIC LETTER FARSI YEH WITH THREE DOTS ABOVE

// Single: [649 306]
0x756, // (ݖ) ARABIC LETTER BEH WITH SMALL V
0x6CE, // (ێ) ARABIC LETTER YEH WITH SMALL V

For an example, here are 6F confusables (which are the symbols that looks like Latin “o”) for Arabic (from the above spoiler):

0x665, // (٥) ARABIC-INDIC DIGIT FIVE
0x6F5, // (۵) EXTENDED ARABIC-INDIC DIGIT FIVE
0x647, // (ه) ARABIC LETTER HEH
0x6BE, // (ھ) ARABIC LETTER HEH DOACHASHMEE
0x6C1, // (ہ) ARABIC LETTER HEH GOAL
0x6D5, // (ە) ARABIC LETTER AE

We already have 6F5 mapped to 665. Of the remaining symbols, which one is preferred or which ones aren’t actually confusing?

I would say 665 is visually different from 6BE. 647/6C1/6D5 look the same but are visually distinct from the 655 and 6BE. Only 1 of those 3 would be allowed but I don’t know which one.

  • 647 has 753 registrations
  • 6C1 has 12
  • 6D5 has 1

This makes me think that 6C1 and 6D5 should be disallowed.


For your example, these are Arabic confusables for 649 (which is also Arabic)

0x649, // (ى) ARABIC LETTER ALEF MAKSURA <= Confusable Primary
0x6BA, // (ں) ARABIC LETTER NOON GHUNNA
0x8BD, // (ࢽ) ARABIC LETTER AFRICAN NOON
0x64A, // (ي) ARABIC LETTER YEH
0x6CC, // (ی) ARABIC LETTER FARSI YEH
0x6D2, // (ے) ARABIC LETTER YEH BARREE

  • 649 has 108 registrations
  • 6BA has 0
  • 8BD has 0
  • 64A has 3171
  • 6CC has 122
  • 6D2 has 2

This looks like 3 separate characters to me:

  1. 649, 64A, 6CC (64A has dots and 649/6CC do not)
  2. 6BA, 8BD
  3. 6D2

From this, I would disallow: 649 and 6CC based on registrations. I don’t know how to choose between 6BA and 8BD (both 0 regs).

If those dots make 64A distinct, I would keep 649.


As a potential solution: for each script, I could compute a report of the single-script confusable groups along with their registration counts. I would need users of those scripts to discern if any of those groups should be broken up further, and for each remaining group with 2+ characters, which is the preferred character. The end result is simple: non-primary single-script confusables must be disallowed.

Just to clarify: imagine the name “XY” where X and Y are single-script confusable. If you simply enforce both X and Y can’t be used together, then “XX” and “YY” would be valid but since confusable means X looks like Y, that also means “XX” looks like “YY”, thus only one must be allowed (unless they weren’t confusable in the first place.)

3 Likes

Quick update:

  • I was able to add a bunch of the single-script confusables myself using registrations as a guide. I will provide a list of the characters that I need help resolving.
  • I should have updated error reports very soon.
  • I’ve changed “fraction slash” (½.eth) function like Apostrophe (can’t lead, trail, or touch) – but maybe it should just be disabled? It is legal in UTS-46 but clearly looks like / when not touching digits (or with an nonsupporting font).
  • New Tool: Recent 1000 ENS Registrations w/r/t/ Normalization

I discovered an issue when updating the ENSIP docs to match my implementation. I am working on a solution. Sorry for the delay.

4 Likes

I think I’ve got my head around the problem. The following is a little scuffed as I’m in the middle of it, but any help on the following problem would be greatly appreciated. Otherwise, I must allow or disallow ALL of these characters.


For example, these are 2 Hebrew characters are confusable. I need to pick a subset of them. The purple integer is the number of registered names using that character.
image

It would also be useful to know if those names were valid with ENSIP-1, or approximately normalized with latest code, or how the character is used, it’s neighbors, or if it’s appearance changes when combined with other characters (even duplicates of itself) – but it’s difficult to present all this information.

In the config file, this corresponds to the following entry for the “l” confusable:

The first question is: can any of these Hebrew characters stand on their own against other “l” like characters in other scripts that have been marked as primary(*). A non-primary confusable must have another non-confusing characters to be allowed. ASCII is default primary.

The second question is: of the characters that aren’t primary, which is preferred? some? both? or neither? I use a second annotation allow(*) which enables this case.

This file is a little confusing because it exists before any ENS rules have been applied, so it contains extra stuff. For convenience, I’ve commented out every confusable which correspond to a disallowed IDNA 2003 character.

Possible answers to the above example would be: “both are bad” / “5D5 is preferred” / “they don’t look confusable to me”, etc.

4 Likes

I don’t know Hebrew. But I think both should be allowed, as long as the name is single-script. Same goes for the other examples in that list, like く or ノ or へ (which would block out a lot of legitimate words if disallowed).

The users of those languages around the world must already be intimately aware of any possible confusables, same as we in the English world are very aware of “I” vs “l” or “m” vs “rn” and so on.

As long as the name is contained to a single script then I think that’s fine.

3 Likes

I agree with this.

May I suggest that names would be disqualified for a refund if:

  • have been traded
  • sold for a profit
  • has been used for scams, wallet drains, exploits, etc…
  • Names attached to any addresses that are blacklisted by USDT or USDC or any other assets due to misuse or unauthorized behavior or actions on chain.
    ( i know that would probably be a tough list to accumulate )
3 Likes

In this case the refund would go to the current owner; the original owner has already been paid.

These would be extremely difficult to adjudicate.

3 Likes

I’m guessing that they will also only get a refund of the minting price and not what they paid on the secondary market

That’s correct.

3 Likes

For now, I’ll have a global setting for any script with more than 1 undecided confusable, and I’ll default to allow. However, I do think many these characters are confusable and should be reduced to one.


Previously, I ran into two problems while trying to finalize my latest changes in words. I had a problem sharing characters between multiple scripts and I was confusing on names of the form AB where A was A-C confusable and B-D confusable but C-D isn’t confusable—essentially it was confusing against a name that wasn’t possible. I also had the script logic setup an inefficient way which required all of the esoteric stuff to be tested first.

I discovered these issues while updating the ENSIP document and making some small adjustments based on some input I’ve been getting from other ENS users.

In fixing this issue, I think I greatly simplified the process.

First, there’s the description of the different types of names we want to allow. At the moment, these are script-centric (I’ve been calling them ScriptGroups), but they could be more general. Here are two examples:

{name: 'Latin', test: ['Latn'], rest: ['Zyyy', 'Zinh'], cm: -1, extra: explode_cp('π')}
{name: 'Japanese', test: ['Kana', 'Hira'], rest: ['Hani', 'Zyyy'],  cm: -1, romanize: true}

The way it’s defined is unrelated to what it actually is—it’s simply a set of characters and some rules for which characters can have combining marks.

{name: 'Latin', cps: [61, 62, ...], cm: ['â', ...]}
{name: 'Japanese', cps: [...], cm: [...]}

Second, there’s the IDNA-like part, where you take a string, parse out emoji, remove ignored characters from remaining sections, apply mappings, and then NFC those segments. This produces a simplified version of ens_tokenize() which only has 2 token types:

[ Emoji("💩"), Emoji("💩"), NFCString("a"), Emoji("💩"), NFCString("bc") ]

You actually don’t need to know if the characters are valid at this point. However, you can precompute a union from the characters in each ScriptGroup (and their decomposed parts) to fail early. You can also precompute non-overlapping sets, for which only a unique group can match. From the original group definitions, the characters can also be split into primary/secondary. For example, “a” is primary to Latin and “1” is secondary to Latin (since it’s primary to Common.)

This makes the matching logic very simple:

  1. Given the list of simple tokens, just look at the textual ones.
  2. For each character, see if there is a unique group.
    • If there is more than one unique group, error you have an illegal mixture.
  3. If you didn’t have a unique group, find all the groups that contain any primary character in the label.
    • If there’s no groups, error you have some weird name that matches no groups.
  4. At this point, you have at least one group.
  5. For each group, check if every character is valid according to that group and apply the combining mark rules. The first group that matches will work. If no group matches, throw the first error that occurred.

Note: there are strings that are made entirely of valid characters but match no groups. There are also strings that match multiple groups (which also may have different combining mark rules.)

With this setup, the order of the groups only matters for when you have a name that matches multiple groups (and the choice really doesn’t matter). For example, any digit that’s shared between multiple scripts, might fail on a restricted group but pass on normal group. I’ve set the order of the groups to match the distribution of registrations for efficiency. It’s also easy to just extract 1 group, like Latin, and ignore everything else, which would make a very tiny library.

For marks, I support a few modes: either you can explicitly whitelist compound sequences like, “e+◌́+◌̂” (which I’m doing for Latin/Greek/Cyrillic and a few others) or you can specify how many adjacent marks are allowed (when decomposed). The set of available marks is also adjustable per ScriptGroup (most marks are in the Zinh script because they’re shared between multiple scripts, but some script extensions give scripts all their own marks.) For example, I can bump Arabic to 2 and Common to 0, while leaving restricted groups at 1.

There’s also 2 Unicode script adjustments: There’s script extensions, which change a singular (Charscript) map to a (CharSet<script>). For example, there are some characters used in exactly N scripts, but not in any others (unlike Zinh which is allowed in all). There’s also augmented scripts, which permits separate scripts to mix, like Kana+Hani or Hira+Latin. Lastly, there’s edits we might want to make, like letting a restricted script access it’s currency symbol or allowing a scripted character (ether symbol is Greek) to be used universally.

And lastly there’s confusables, which are a set of different contiguous sequences which compose visually-indistinguishable. Unicode provides only a 1-character to a N-character map, eg. the single character "ɨ" looks like "i+◌̵" (which is 2). Confusables between different groups is one problem (whole-script confusable). Confusables of the same group are another (single-script confusable.) Mixed-script confusables are replaced by the ScriptGroup logic, which allows you to construct a Japanese name with Kana+Hira+Hani+Zyyy+Latn+Emoji but not Latn+Grek.

This makes the process simple: just need to generate the correct ScriptGroups according to all these constraints, which allow the simple matching logic to work 100%.


I think I’ve solved the issue. I appreciate the patience.

3 Likes

Confusable Tool (alpha)

Two confusables that together are non-confusable
image

Common “0” + Cyrillic "x"

Yellow rows are the characters in the string. The cells represent the extent of a confusable: they can span groups and there can be multiple characters from the same group. The green header means all the characters in that string are in that group. A framed green header corresponds to the normalization decision. Red headers represent another group that can recreate the input string using at least one confusable swap.

Green cells were decided as non-confusable. Orange cells had no decision and are decided automatically (default allow). Blue cells were decided as confusable but allowed. Red cells were leftover characters in a decided confusable (they still could be valid if they’re the only character in their group.)

After IDNA, normalization, and ENS rules are applied, any cell with the same Capital letter (within a confusable) are interchangable. Bold cells are non-confusable. Gray cells are confusables due to the links between characters and groups. Cells without a letter are disallowed. White cells are unreachable.

Cyan rows are characters that not confusable. Light-blue rows are characters that are in the same group as one of highlighted headers.

I’ve made an attempt to organize the errors so they make sense. I try to fail as early as possible using left-to-right processing.

The distribution of errors is dependent on the sequence of the checks so this isn’t representative of the type of error, just the frequency that an end-user would see it.

The following is a tally from 2M names, taking the prefix of each error (before the first colon).

  11543 disallowed character
   7419 (not an error: different norm)
   1024 illegal mixture
    866 underscore allowed only at start
    349 whole-script confusable
    157 invalid label extension
    100 fenced
     32 too many combining marks
      1 emoji + combining mark

Fenced:
     43 trailing apostrophe
     38 leading apostrophe
     10 leading fraction slash
      4 trailing fraction slash
      2 leading middle dot
      2 adjacent fraction slash + fraction slash
      1 adjacent apostrophe + apostrophe

Assuming I don’t encounter another problem, I should have updated error reports grouped by error type for easy review very soon.

5 Likes

Preliminary error reports by type for easy inspection:

Comparison report


Might be useful: ENS Emoji Frequency

1 Like
  1. 201A (‚) SINGLE LOW-9 QUOTATION MARK should be valid.

Allows for run on digit sequences such as 888888888 to be induced into their much more readable form; 888‚888‚888

Shouldn’t happen as it is too close to . which would signify a subdomain

1 Like