Please wait until these normalization changes are finalized. Once we have the final code and reports, we’ll know exactly which previously-valid names will have invalid metadata, and the ENS community here will I’m sure have a discussion about how to handle these names and under what circumstances refunds may be given, etc.
Please keep conversation on this topic to discussion of the changes to the normalisation function. Offtopic replies such as requests for refunds or queries about the status of individual names will be deleted.
This looks to be something specific to how OS is resolving it, not having to do with current ENS name normalization. On LooksRare the same name does not “convert” to English numerals.
Oddly enough, OS has the right ENS name in the window tab’s title:
Names with special characters will still show the non-ascii warning. This warning is returned from the ENS metadata service: GitHub - ensdomains/ens-metadata-service
In @randomname’s comment above the first ENS has Persian zeros prepended to Arabic digit 6.
The other examples are either Arabic or Persian, and there is nothing wrong with them.
What Octexor said is correct, however there is a whole-script confusable issue with Arabic digits. In my latest code (not yet released), you can’t have a name fully composed of [0-3,7-9] because all those digits exist in multiple separate scripts as confusables.
To permit 123.eth, I suggest we map one set of [0-3,7-9] digits to the other, and allow the mapped version in the appropriate scripts. This would also allow (2) versions of 124.eth: ١٢٤ and ١٢۴.
Otherwise, the only Arabic pure digit names that are allowed would be ones that contain one or more 4-6 (which determines the script.)
Edit: The problem with mapping is that we have to choose one of the scripts.
I see the confusion caused by these sets of digits.
And it’s hard to make these decisions for a whole community of users in the middle east with different languages.
But, at least in the case of Persian and Arabic (There are also Urdu, Sindhi and Kurdish which are similar) I can see that many of the Unicode characters used for the alphabets of these languages are the same. (Arabic characters are borrowed for the Persian keyboard and extended) And I’m wondering why these two sets of numbers exist where 7 of the 10 glyphs are exactly the same. (۰ ۱ ۲ ۳ ۷ ۸ ۹)
Considering the history of these languages and the fact that currently less of the [U+06F0 - U+06F9] digits are used in the registered names, I think the easiest option would be to map
[U+06F0 - U+06F3, U+06F7 - U+06F9] to [U+0660 - U+0663, U+0667 - U+0669]
And permit mixing the 6 other characters (۴ ٤ ۵ ٥ ۶ ٦) as @raffy described.
I was worried about name lookups, (that some of the Persian digits would not be found even though they were entered during registration) but I guess the same normalization has to be done for every name lookup anyways, because of all the other mappings.
The sooner the new normalization is released, the less confusions will happen during mass adoption.
More details about Unicode for middle eastern languages can be found here:
Another interesting topic to consider for normalization is the use of diacritics in these languages which can result in different permutations of each word. I recommend to consult with native speakers of each language for that.
@raffy thanks for the awesome work you are doing. I have no idea how you are able to wrap you head around all the possible languages and their glyphs!
I was always under the assumption that Arabic-Indic Digit Zero were the valid original ones, and the extended Arabic ones were the invalid ones, trying to copy.
@raffy
I’m not exactly understanding the proposed outcome. What do you mean you can’t have a name fully composed of 0-3 for example, so you cannot have 003 in Arabic but you can have 004? So to have 123.eth , the non extended should be the original and the extended version should have a yellow traiangle or be considered a duplicate or not allowed?
I think everyone was under the impression that the Arabic-Indic Digit Zero were the original and authentic, and the extended version is technically invalid and is a duplicate with yellow caution sign?
١٢٣ [661 662 663] and ۱۲۳ [6F1 6F2 6F3] look the same (123 vs 123).
١٢٤ [661 662 664] and ۱۲۴ [6F1 6F2 6F4] look different (124 vs 124).
۱۲٤ [661 662 664] and ۱۲٤ [6F1 6F2 664] look the same (124 vs 124).
If the name is pure digits, then it needs one digit from 4-6 (case 2) + the non-extended/extended exclusion rule (where second example of case 3 fails) to be valid.
If we want to support pure digit names with strictly 0-3,7-9 digits (case 1), we have to pick one. Otherwise, how do you justify case 1 and case 3 being two separate names?
Yeah it seems like for [0-3,7-9] the “Extended” digits should map to the regular digits. And then [4-6] can be allowed from both sets since they are distinct.
@474.ETH so if that mapping is used, then yes you can have 003 in Arabic, the valid name would use the non-extended digits: ٠٠٣.eth
Any use of the extended [0-3,7-9] digits would just normalize to the regular digits. So if someone enters the extended ۰۰۳.eth, it would normalize to the non-extended ٠٠٣.eth, and would resolve records for that non-extended name against the smart contracts.
Even if you did have the extended ۰۰۳.eth registered, it would have invalid metadata, would not be able to be listed on any sites that depend on metadata (OpenSea/etc), and people would not be able to send money to that name because their wallet would auto-normalize to the non-extended ٠٠٣.eth and send there instead.
It’s the same thing that happens with capital characters. You could manually register the capital GOD.eth against the smart contracts, but it is not going to be valid and will be essentially useless because all clients will normalize to god.eth first.
Alright, I’m glad that we were able find a solution for this one.
Btw, I’m a web dev and as you might have guessed my first language is Persian
Feel free to dm me on twitter if you think I’d be able to help in some way.
whole script confusing means every non-overlapping subsequence has a confusable in another script. Note: this needs relaxed to: there exists another script, that has confusables with every non-overlapping subsequence.
eg. apple (Latin) vs аррӏе (Cyrillic).
confusing "x" means that there is a same-script confusable.
eg. a vs ɑ (both Latin).
Modifications to the spec:
I made ASCII globally unconfusable.
Each confusable group is such that if Confuse(a,b) and Confuse(b,c) then Confuse(a,c).
For specific scripts, you can choose a default sequence, eg. Cyrillic has 3 e’s: еꬲҽ, I made е default.
Current issues:
You can circumvent some confusables with combining marks by inserting additional marks. eg. Ac (A=letter, c=mark) can broken with Abc (b=another mark.)
Many characters need their confusables disabled. eg. ٠٠٠ (Arabic 000) doesn’t work at the moment because it confuses with ꓸ [A4F8], ١١١ (Arabic 111) confuses with a bunch of things. ٥۵ (5 and extended-5) confuse according to the spec.
Many confusables are missing. eg. ѐè isn’t in the Unicode database
Overall, I think it’s doing the right thing in general, it’s just too strict.
Can we not disable single-script confusables across the board? Presumably if two different but similar characters exist in the same script, it’s for a reason?