ENS Name Normalization

Sorry I’ve been a little slow on updates.

What Octexor said is correct, however there is a whole-script confusable issue with Arabic digits. In my latest code (not yet released), you can’t have a name fully composed of [0-3,7-9] because all those digits exist in multiple separate scripts as confusables.

To permit 123.eth, I suggest we map one set of [0-3,7-9] digits to the other, and allow the mapped version in the appropriate scripts. This would also allow (2) versions of 124.eth: ١٢٤ and ١٢۴.

Otherwise, the only Arabic pure digit names that are allowed would be ones that contain one or more 4-6 (which determines the script.)

Edit: The problem with mapping is that we have to choose one of the scripts.

3 Likes

I see the confusion caused by these sets of digits.
And it’s hard to make these decisions for a whole community of users in the middle east with different languages.

But, at least in the case of Persian and Arabic (There are also Urdu, Sindhi and Kurdish which are similar) I can see that many of the Unicode characters used for the alphabets of these languages are the same. (Arabic characters are borrowed for the Persian keyboard and extended) And I’m wondering why these two sets of numbers exist where 7 of the 10 glyphs are exactly the same. (‍۰ ۱ ۲ ۳ ۷ ۸ ۹)

Considering the history of these languages and the fact that currently less of the [U+06F0 - U+06F9] digits are used in the registered names, I think the easiest option would be to map
[U+06F0 - U+06F3, U+06F7 - U+06F9] to [U+0660 - U+0663, U+0667 - U+0669]
And permit mixing the 6 other characters (۴ ٤ ۵ ٥ ۶ ٦) as @raffy described.

I was worried about name lookups, (that some of the Persian digits would not be found even though they were entered during registration) but I guess the same normalization has to be done for every name lookup anyways, because of all the other mappings.

The sooner the new normalization is released, the less confusions will happen during mass adoption.

More details about Unicode for middle eastern languages can be found here:

Another interesting topic to consider for normalization is the use of diacritics in these languages which can result in different permutations of each word. I recommend to consult with native speakers of each language for that.

@raffy thanks for the awesome work you are doing. I have no idea how you are able to wrap you head around all the possible languages and their glyphs!

3 Likes

I was always under the assumption that Arabic-Indic Digit Zero were the valid original ones, and the extended Arabic ones were the invalid ones, trying to copy.

@raffy
I’m not exactly understanding the proposed outcome. What do you mean you can’t have a name fully composed of 0-3 for example, so you cannot have 003 in Arabic but you can have 004? So to have 123.eth , the non extended should be the original and the extended version should have a yellow traiangle or be considered a duplicate or not allowed?

I think everyone was under the impression that the Arabic-Indic Digit Zero were the original and authentic, and the extended version is technically invalid and is a duplicate with yellow caution sign?

I’m not sure how we resolve the issue, but:

  1. ١٢٣ [661 662 663] and ۱۲۳ [6F1 6F2 6F3] look the same (123 vs 123).
  2. ١٢٤ [661 662 664] and ۱۲۴ [6F1 6F2 6F4] look different (124 vs 124).
  3. ۱۲٤ [661 662 664] and ۱۲٤ [6F1 6F2 664] look the same (124 vs 124).

If the name is pure digits, then it needs one digit from 4-6 (case 2) + the non-extended/extended exclusion rule (where second example of case 3 fails) to be valid.

If we want to support pure digit names with strictly 0-3,7-9 digits (case 1), we have to pick one. Otherwise, how do you justify case 1 and case 3 being two separate names?

1 Like

Yeah it seems like for [0-3,7-9] the “Extended” digits should map to the regular digits. And then [4-6] can be allowed from both sets since they are distinct.

@474.ETH so if that mapping is used, then yes you can have 003 in Arabic, the valid name would use the non-extended digits: ٠٠٣.eth

Any use of the extended [0-3,7-9] digits would just normalize to the regular digits. So if someone enters the extended ۰۰۳.eth, it would normalize to the non-extended ٠٠٣.eth, and would resolve records for that non-extended name against the smart contracts.

Even if you did have the extended ۰۰۳.eth registered, it would have invalid metadata, would not be able to be listed on any sites that depend on metadata (OpenSea/etc), and people would not be able to send money to that name because their wallet would auto-normalize to the non-extended ٠٠٣.eth and send there instead.

It’s the same thing that happens with capital characters. You could manually register the capital GOD.eth against the smart contracts, but it is not going to be valid and will be essentially useless because all clients will normalize to god.eth first.

4 Likes

Here is how I understood your suggestion, which I think would work:

  • we always do this mapping during normalization:

    • 6F0 → 660
    • 6F1 → 661
    • 6F2 → 662
    • 6F3 → 663
    • 6F7 → 667
    • 6F8 → 668
    • 6F9 → 669
  • the second examples of case 1 and case 3 both normalize to their first examples
    [661 662 663] and [661 662 664]

  • in case 2 the second example normalizes to [661 662 6F4] and will be its own name, the first example remains as is.

In total you’d get 4 distinct names.
Do you see any other issues here?

2 Likes

Yeah that looks good to me. I agree with this mapping and I think this enables the most possible names and avoids the confusing ones.

4 Likes

Alright, I’m glad that we were able find a solution for this one.
Btw, I’m a web dev and as you might have guessed my first language is Persian :wink:
Feel free to dm me on twitter if you think I’d be able to help in some way.

2 Likes

This error report has very preliminary whole-script confusables + single-script confusables. I also updated the demo to reflect these changes.

The confusables need some adjustments but it’s better than I expected – it’s actually pretty close to my initial estimate above.

This file has the confusable groups, with their corresponding script sets.

[{"hex":"1041", "form":"၁", "uni":["Cakm","Mymr","Tale"], "ty":"C"},
 {"hex":"1065", "form":"ၥ", "uni":["Mymr"], "ty":"C"}]

(2) new errors:

  1. whole script confusing means every non-overlapping subsequence has a confusable in another script. Note: this needs relaxed to: there exists another script, that has confusables with every non-overlapping subsequence.
    eg. apple (Latin) vs аррӏе (Cyrillic).

  2. confusing "x" means that there is a same-script confusable.
    eg. a vs ɑ (both Latin).

Modifications to the spec:

  • I made ASCII globally unconfusable.
  • Each confusable group is such that if Confuse(a,b) and Confuse(b,c) then Confuse(a,c).
  • For specific scripts, you can choose a default sequence, eg. Cyrillic has 3 e’s: еꬲҽ, I made е default.

Current issues:

  • You can circumvent some confusables with combining marks by inserting additional marks. eg. Ac (A=letter, c=mark) can broken with Abc (b=another mark.)

  • Many characters need their confusables disabled. eg. ٠٠٠ (Arabic 000) doesn’t work at the moment because it confuses with ꓸ [A4F8], ١١١ (Arabic 111) confuses with a bunch of things. ٥۵ (5 and extended-5) confuse according to the spec.

  • Many confusables are missing. eg. ѐè isn’t in the Unicode database

Overall, I think it’s doing the right thing in general, it’s just too strict.

1 Like

Great work!

Some cases I see where confusables should be disabled (as long as it’s single script, I presume):

  • Arabic digits (but with the extended rules as discussed above)
  • Arabic characters (e.g. مُحَمَّد)
  • Persian/Farsi characters (e.g. پرسپولیس)
  • Hindi characters (e.g. इंडिया)
  • Hebrew characters (e.g. מבורך)
  • Armenian characters (e.g. վանկազուրկ)
  • Greek characters (e.g. χωρίςτράπεζα)
  • Cyrillic characters (e.g. русский)
  • Chinese characters (e.g. 黑客松)
  • Japanese characters (e.g. ミュウツー)
    • Including kanji, hiragana, katakana, and the elongation character ー
3 Likes

Can we not disable single-script confusables across the board? Presumably if two different but similar characters exist in the same script, it’s for a reason?

Sure, so ignore all same-script confusables?
Name only needs to be be single-script and cannot be whole-script confusable?

That’s what I’m suggesting - but it wasn’t a rhetorical question. Are there situations where it makes sense to prohibit a character because it’s confusable with another character in the same script?

RE whole-script confusable, how do we intend to handle these? Will we have to select a ‘canonical’ script for each pairing where this can happen?

1 Like

Even in Latin, there are many: ꭓꭕ, rꭇꭈ, , ꜹꜻ, ƫț, ĕě, gɡᶃƍ, etc.
Arabic: ١ (Digit 1) vs ا (Alef)
Myanmar: ၁ၥ
Some have an obvious canonical form, others it’s not clear.

I’m following the logic here + Script Extensions + Highly Restrictive + Output Confusables (they have to survive normalization: mapping + NFC).

At the moment, ASCII has priority, and everything else always conflicts. My idea above was to use a script ranking to resolve whole-script confusables to always permit one form (unless they have the same rank).

eg. if you have a name that’s covered by two scripts [A,B] and every sequence conflicts with another script [C], you consult Min(Rank(A),Rank(B)) < Rank(C) to decide if you have the “canonical” form.

1 Like

Is it reasonable to expect speakers of those languages to understand the differences? Someone who didn’t know English might ban ‘i’ because it looks a lot like ‘l’ but obviously we would find that problematic.

I’ve stayed out of this because it was generally over my head, and I don’t have a use for non-ascii characters, but…

Why perform gymnastics to accommodate edge cases? Adhering to the established allowed characters on DNS would appear to save a bunch of headaches and serve as a reliable, well-known standard for any service to easily integrate.

1 Like

The “established allowed characters on DNS” is more complex than it first appears.

DNS itself only supports a subset of ASCII - 0-9a-z_-. With punycode it supports a large part of unicode, and uses UTS-46 normalisation to handle (some degree of) disallowed or duplicative characters. Many registries impose additional limitations on top of that, such as only allowing certain scripts, or disallowing emoji.

The question becomes, what set of rules to adopt? There isn’t just one standard, and many of the existing ones have deficiencies as well.

1 Like

Hi everyone, not sure if this is 100% the right place but thought to flag one more category of alphabet, the Scandi/Nordic ones, which now seems problematic.

I minted earlier a 3letter hyphen domain (ø-ø.eth) together with other similar others, and everything was just fine until a month or so. But now, these are marked as malformed.

æ
å
ä
ö
ø

As said, these hyphenated L-L domains were totally fine for weeks, but then changed to invalid.

Would be great to know what’s happening as there’s no point extending registration if they cannot be used.

Three letter ones without hyphen, at least with just one scandi letter are still fine, eg. løv.eth

Edit:

Further to my message above, seems also that domains with two consecutive Scandi wovels are now changed to invalid. Really like my øø7.eth which has now turned useless!

Not a technical person myself but these are on common on keyboards in respective countries, some of these are also found in German and other languages, and also seem to adhere to ISO standards:

From Wikipedia: several different coding standards have existed for this alphabet:

1 Like

I’m Swedish and I can’t think of any real life use-case where you’d use Ä-Ä, Å-Å, Ö-Ö with one single letter. However, this might be an issue for people who wants to register hyphenated double-surnames.

If a Swedish surname contains an Å, Ä or Ö it shows as invalid when hyphenated with another surname. I have an Ö in one of my surnames and it’s a nightmare to use it on the internet.

PS: Å, Ä and Ö etc aren’t A and O with a mark or umlauts, in Swedish they’re their own individual letters.

1 Like

Hello frens,

So i’ve came across the keycap emojis and their visual look. Once registered the Unicode changes a bit.

Because of this i can’t look them up on opensea nor can i search them with the original keycap emoji. Besides that, the visual look on etherscan is completely bugged out because of this. Look for yourself.
bugged40

bugged41

Is there a chance that you guys can get this fixed?

Kind regards,

N e

1 Like