ENS Name Normalization 2nd

Your opinion on confusables is skewed.

Ever since the underscore debate your opinons have been subpar.

No. Geresh is Hebrew. However, I’m not sure how to handle the RTL languages. I haven’t been able to get much feedback regarding CheckBidi or if a label could be mixed direction (mixed direction names are really weird to manipulate.) Hebrew could easily be script restricted (which makes direction irrelevant) – I’m just not sure if that’s acceptable.

I think we should revisit the non-RGI whitelisted emoji. I now think it’s a mistake to whitelist any non-RGI emoji.

Reference: UTS-51: Emoji Sets

  • Note: this suggests skin color modifiers should be allowed as singular, but doesn’t account for rendering/platform differences which cause the modifier to get absorbed into other characters.
  • Note: this also suggests that Emoji which aren’t Basic_Emoji aren’t RGI. This only includes digits (+FE0F) and single regionals, which are both disallowed.

Character Viewer (for desktop)

2 Likes

I didn’t realize any non-RGI were whitelisted, but yeah I agree that only RGI emojis should be allowed. If any become RGI in the future, then they can be added in future updates, such as when we update for a new Unicode version.

2 Likes

Hebrew and Arabic both read numbers left to right, so there are mixed names possible always. It breaks brains, and many programs. Check out the El Al logo. It reads the same RTL and LTR. Cool use case to ponder.

Where are we now with hyphen rules? It’s hard to follow with all the versions. Is it still only leading ones? They are getting more attention these days in the speculation world.

This goes really deep into a lot a lot of things. I concur with zombiehacker. If there is a way to roll out in stages of languages after consulting with multiple academics in each one it would be good. There’s a lot of crazy stuff with various languages, and the ENS should support them all. Language unification is one of the worst things that could happen to humanity, and the DNS took us quite a ways to that point. Decentralization is really based on every language and culture having a proportionally equally voice.

2 Likes

Hyphens are anywhere except **-- (position 2,3 if the label is ASCII.). Reasonable hyphen-like characters are mapped to hyphen. Remaining dash-like characters are disallowed.

I have focused on Latin/Common scripts, punctuation, emoji, and symbol-like characters. Excluded/limited-use scripts have been restricted. The ASCII confusables scripts (Greek/Cyrillic) can’t mix or use whole-script confusables. Combining marks are restricted to one. Many marks are disabled.

4 Likes

So to clarify, would -0-.eth be invalid, but -00-.eth would be?

I forgot about the leading underscores, is that still just leading?

After thinking about this for some weeks, I really would think everything through with the mappings/groupings. I can think of weird things, like in Dutch ij is one letter, and their keyboard has it as one letter, but everyone Dutch doesn’t want a Dutch keyboard because everyone speaks English. I remember learning Dutch for a year after I lost my phone in Amsterdam in a taxi. I replaced it with the same one, except the Windows CE version it was in was locked to Dutch language. It was a lot of money, so I just learned the Dutch UI instead of buying another phone. My Dutch friends laughed a lot. Microsoft didn’t get that localizing for Dutch meant putting things in English.

The context of that long story matters because even in that localization there wasn’t ij as a letter in the primary keyboard, it was an alternate still, so I always just typed the two letters. This kind of scenario where you can spell things both ways can be infuriating when mapping languages to specific keysets, or making word lists.

I write in Spanish often (poorly), but I never use accents or the ñ because I have an English keyboard. In fact, the English speaking city where I crew up had a street name in Spanish with an ñ, and 30 years on, now it’s replaced with just an n, and everyone says the street name wrong. Kind of like how everyone says Crimea wrong today in English (Hint: it never rhymed with crime before the 2000s). A simple gap in history, and new history was made. I hope the same thing doesn’t happen to more obscure languages just because someone not familiar with the language makes a booboo. I’m not going to take any political stance here, but I do think perception has shifted based on the pronouncing of Crimea, and writing things as Karabakh instead of Artsakh. Almost nobody means harm doing it, but yet ignorance got in there and massaged history for the eyes of outsiders.

Æ and Œ are both valid letters in English. Just because English changed to omit these letters most of the time does not make them less valid. It’s similar to church Russian or old Russian. Ij in Dutch has kind of one the way of Æ and Œ, but in our lifetime. It’s why it was already relegated on the keyboard circa 2006. Should Œ map to oe? Are they the same?

I know how to pronounce them, but I have no idea how to even argue for what to do if there were a mapping of just English letters. Who decides the English alphabet? Is it all of the English words that ever existed? Is it just what is currently taught in schools? What about dialects of English that have weird digraphs?

2 Likes

Both are valid but 00--.eth is invalid because it has -- at 2,3.

Yes. Allowed(__ab.eth) and Disallowed(a_b.eth)

There’s a lot of digraph/ligatures in Latin (50+ lowercase). In general, I think they’re all bad for ENS. I’ve kept a few literal ones ij/ff/fl/… (which are both mapped to their component letters by IDNA) and æ (which stays as one) but I disallowed rest. ſt/st/ꝡ/ꜳ/… seem bad (mapped or valid.)

I’m about to deploy an update with script-based combining-mark whitelists. This enables stuff like “with dot above” on specific Latin characters w/o enabling it on characters like “i” (which is pure insanity.) This also allows whitelisting multiple-CMs at the character-level (since the general rule allows at most one).

For your example, tilde is only allowed over a, e, n, and o. Tilde variations like middle, double-middle, vertical, below, overlay, etc. are all disabled.

For Greek/Cyrillic/Common, I’ve used the exemplars, for Latin, I’ve picked a subset of exemplars at my own discretion. For example, I’m pretty sure we don’t need all these variations: ā a᷆ a᷇ but these are great: é ñ ç.

From the Isolated characters I’ve collected, they all turned out to be Common-script, so I’ve now retired the Isolated concept (characters that should never get CMs) as I get that for free with this approach.

2 Likes

With Latin cleaned up, many of the silly Greek and Cyrillic whole-script confusables disappeared. I also made an effort to remove all of the Greek characters from Latin (like ɑ α, ɩ ι, ẟ δ, etc.). Similar to the ξ and π change (relax script to Common), I think it would be reasonable to do the same to any of α(a), β(ß), γ(y), σ(o), ω(w), τ(t) (I put the nearest Latin confusable in parentheses.)

For example, the Greek whole-script confusables have simplified to a very reasonable set.

Greek Confusables
[
// 61 (a) LATIN SMALL LETTER A (Latn)
0x3B1, // 3B1 (α) GREEK SMALL LETTER ALPHA (maybe)
// 69 (i) LATIN SMALL LETTER I (Latn)
0x3B9, // 3B9 (ι) GREEK SMALL LETTER IOTA
// 6A (j) LATIN SMALL LETTER J (Latn)
0x3F3, // 3F3 (ϳ) GREEK LETTER YOT
// 6F (o) LATIN SMALL LETTER O (Latn)
0x3BF, // 3BF (ο) GREEK SMALL LETTER OMICRON
0x3C3, // 3C3 (σ) GREEK SMALL LETTER SIGMA (maybe)
// 70 (p) LATIN SMALL LETTER P (Latn)
0x3C1, // 3C1 (ρ) GREEK SMALL LETTER RHO
// 72 (r) LATIN SMALL LETTER R (Latn)
0x1D26, // 1D26 (ᴦ) GREEK LETTER SMALL CAPITAL GAMMA
// DF (ß) LATIN SMALL LETTER SHARP S (Latn)
0x3B2, // 3B2 (β) GREEK SMALL LETTER BETA
// 75 (u) LATIN SMALL LETTER U (Latn)
0x3C5, // 3C5 (υ) GREEK SMALL LETTER UPSILON
// 76 (v) LATIN SMALL LETTER V (Latn)
0x3BD, // 3BD (ν) GREEK SMALL LETTER NU
// 79 (y) LATIN SMALL LETTER Y (Latn)
0x3B3, // 3B3 (γ) GREEK SMALL LETTER GAMMA
]

The main issue with confusables is that I don’t know the ordering of all scripts. At the moment, I have 3 ordered scripts (Latin, Greek, Cyrillic), 133 scripts are tagged as Restricted (which don’t interact with anything else) and 25 scripts that need decided.

Why an ordering?

  • Latin "o" confuses with Greek "ο" — which one gets ooo.eth?
  • Greek "φ" confuses with Cyrillic "ф" — which one gets φφφ.eth?

Latin confuses with nothing, Greek confuses with Latin, and Cyrillic confuses with Latin and Greek, and the Restricted scripts confuse with everything but themselves. At the moment, the unordered scripts don’t confuse with anything.

Scripts that Need Ordered or Restricted
1 Armn Armenian
2 Arab Arabic
3 Thaa Thaana
4 Deva Devanagari
5 Beng Bengali
6 Guru Gurmukhi
7 Gujr Gujarati
8 Orya Oriya
9 Taml Tamil
10 Telu Telugu
11 Knda Kannada
12 Mlym Malayalam
13 Sinh Sinhala
14 Thai Thai
15 Laoo Lao
16 Tibt Tibetan
17 Mymr Myanmar
18 Geor Georgian
19 Hang Hangul
20 Ethi Ethiopic
21 Khmr Khmer
22 Hira Hiragana
23 Kana Katakana
24 Bopo Bopomofo
25 Hani Han

As I commented above, I made Hebrew restricted. I looked at all the registered names with Hebrew and they’re almost entirely single-script names. If we keep this classification, it would be possible to give Hebrew it’s own hyphen (Maqaf ־) and punctuation (Geresh ׳ and Gershayim ״). However, as restricted, it can’t mix ASCII or can’t use its own currency symbol (₪) although that could be fixed.


A question was asked about ß vs ss. In UTS-46, ß is a deviation character. Depending on which IDNA version used, it either stays ß or gets mapped to ss. What’s really weird is that capital isn’t a deviation and is hard-mapped to ss. Since ENSIP-1 allowed ZWJ (a deviation), I’ve assume deviations are valid. Note: browsers seem to be at toss up on this, Safari and Firefox keep it valid, Brave and many online punycoders map it. For registrations, people use it as a "b" like ßinance and ßoobs but it’s also used properly in weißbier and philippstöß.


For Common script, the main thing remaining is organizing the shapes, stars, crosses, and arrow symbols (there’s zillions of them spread all over the place.) These could probably remain as-is but they’re effectively useless until they’re reduced to distinct well-supported choices (like emoji.)


I know some of the emoji enthusiasts have developed various lists but it would be great to have a breakdown of emoji that render pixel-identical (:us: vs :us_outlying_islands:), -nearly identical (:horse_racing:t2: vs :horse_racing:t3:), or mask-identical (:family_man_woman_boy: vs :family_man_woman_girl:) on various platforms. Given an emoji, it would nice to know how closely you should inspect it.

4 Likes

Can’t approve more of the thought you are putting into this. I wish more people would be as excited at solving this immensely complex problem that will benefit everyone. I want to point to this thread as proof of why ENS > DNS. It’s not just decentralizing or semi-decentralizing domain names, it’s about solving the problems DNS couldn’t and bringing the world together to avoid conflict, while at the same time preserving culture/language/intellectual diversity.

I know some of the emoji enthusiasts have developed various lists but it would be great to have a breakdown of emoji that render pixel-identical (:us: vs :us_outlying_islands:), -nearly identical (:horse_racing:t2: vs :horse_racing:t3:), or mask-identical (:family_man_woman_boy: vs :family_man_woman_girl:) on various platforms. Given an emoji, it would nice to know how closely you should inspect it.

I have an idea for this for the scanner. Each emoji should have alt text. You give each emoji a numerical UID and you could also give it an English name for lookup on docs. So when a user hovers over it in a scanner they can see the UID. Of course the token ID works for devs, but a succinct number is best (which is why digits are so popular now in ENS).

As I commented above, I made Hebrew restricted. I looked at all the registered names with Hebrew and they’re almost entirely single-script names. If we keep this classification, it would be possible to give Hebrew it’s own hyphen (Maqaf ־ ) and punctuation (Geresh ׳ and Gershayim ״ ). However, as restricted, it can’t mix ASCII or can’t use its own currency symbol (₪) although that could be fixed.

Hebrew and Arabic are languages deeply tied into the culture and religion (although Arabic predates Islam and it’s also important in other religions). They both have diacritical marks important to religious texts. Hebrew has pointed letters that are usually omitted, but it’s like omitting ë from Russian names. You can’t map ë to e. This is why everyone say Khruschev wrong. His name has a ë and it’s actually pronounced Hruschove which sounds like shove, not shev. This applies to many names that are butchered needlessly in English from poor transliteration, that stems from a poor understanding of English dialects and pronunciations.

My grandfather’s name was Saul. In Yiddish and Hebrew it is שאול with diacritics and שָׁאוּל. The first letter is shin, which can be s or sh depending on where the normally omitted point on the letter is. His mother (who only spoke Yiddish) called him Saul like in Better Call Saul even though it’s actually pronounced Shaul in both Hebrew and Yiddish. Why? Because in the Russian Empire (now claimed by the state of Ukraine) the Yiddish speaking areas had to transliterate names into Russian (old Russian). Births in shtetls were registered in both Yiddish and Russian, and like Khruschev, Saul got mistransliterated to Саул, so people just gave up trying to say it right, much like politicization of the pronunciation of various terms in Russian/Ukrainian.

I had the same problem with my daughter. She was born in a Latino country, but is entitled to Ukrainian, Russian, and Israeli passports. She has a name with a sound that doesn’t exist in Russian. She was also named something English in a Spanish speaking country, and there were multiple transliterations into Cyrillic going on. Name normalization in the Cyrillic (another word pronounced wrong now in English - used to be a hard K, unlike Celtics which Americans amusingly say historically correctly) passports means that not only did need a waiver to translate from English transliteration, it meant that the transliteration back to ASCII from Cyrillic (all passports need ASCII) needed a waiver as well so it would match her birth certificate’s English name.

This was not a small deal, it took almost an hour to solve and wrap the head around, because the computer systems were not made to solve this problem automatically. It took humans to say that the computer was wrong, and to override. I provide details of these things as a warning to the real world use of normalization and mapping. It can be ugly in some spots. I won’t get into the the Americanization of Dutch surnames this time. This crime happened recently because of primitive databases. :slight_smile:

One more question on the hyphens. Is the trailing hyphen out now too? So _qwerty_.eth fails?

2 Likes

Your wording has changed on this

It used to be position 3 & 4

1234

Now it is 2 & 3

0123

Are you going to use 0123 from now on?

@Ronald

I’m guessing you are meaning trailing underscores?? They are easy to confuse

Underscores are only going through if they are leading

You can have multiple leading underscores

You can not have underscores in the middle of a name

You can not have underscores trailing

2 Likes

Sorry, it’s a programming thing (0-vs-1 based indexing). It’s always been 3rd and 4th characters (think punycode, xn--a...).


I took the remaining unordered scripts and restricted most of them based on the purity of their registrations. The following scripts have considerable registrations, where the second row has nearly all pure registrations:

  1. Hang, Hani, Kana, Hira, Latn, Cyrl
  2. Arab, Deva, Hebr, Grek, Thai, Beng, Taml

I decided to make the 2nd row script-restricted as well. This greatly simplifies the remaining ordering problem, as Latn/Cyrl are pretty distinct from CJK and I can split them based on Hira/Kana vs Hang.
image

Although, its probably hard to sample, I’d imagine some scripts would prefer if their names were pure and unmixable.


Based on registrations, I allow the CJK to be romanized and allow access to unadorned Latin a-z.
image


There are a lot of dot characters but there are at least 4 “middle dot” dots:

0xB7, // (·) MIDDLE DOT (Common) 
0x387, // (·) GREEK ANO TELEIA (Common) => IDNA mapped to B7
0x30FB, // (・) KATAKANA MIDDLE DOT (Common)
0xFF65, // (・) HALFWIDTH KATAKANA MIDDLE DOT => IDNA mapped to 30FB

ContextJ says Middle dot is only allowed between two L’s, l·l (there are also dedicated characters for this Ŀ and ŀ). Only 1 name in ENS uses this form, and it’s by accident (al·la·huak·bar).

There are 167 registrations with B7.
There are 49 registrations with 30FB

Although B7 has more registrations, the full-width () middle dot 30FB seems like the correct one (if any are allowed). Maybe it could use the same rules as apostrophe? Most of the registrations that use it are using it with pure Japanese ペガサス・j・クロフォード. About half the registrations of B7 are Latin or digits (as a prefix: ·555) and the other half are used correctly (豚林·vitalik).

I was intending to enforce ContextJ but I think its better to just disallow B7 (it’s not worth the code complexity.)

6 Likes

I pushed a pretty large update. Resolver | Characters | Emoji

Preliminary error report:
ens_normalize (1.7.0) vs eth-ens-namehash (2.0.15) [1528752 labels] @ 2022-10-24T10:48:41.586Z (4MB)
(The error report code needs improved. It should probably be grouped by error type and split by script.)

I added a new feature ens_split(name): Label[] which basically produces a JSON structure of the information in the screenshot below:

JSON Description
[
  {
    input: [ 49, 65039, 8419, 82, 97, 771, 102, 102, 121, 128169 ],
    offset: 0,
    mapped: [ [ 49, 65039, 8419 ], 114, 97, 771, 102, 102, 121, [ 128169, 65039 ] ],
    output: [ 49, 8419, 114, 227, 102, 102, 121, 128169 ],
    emoji: true,
    script: 'Latin'
  },
  {
    input: [ 101, 116, 104 ],
    offset: 11,
    mapped: [ 101, 116, 104 ],
    output: [ 101, 116, 104 ],
    emoji: false,
    script: 'Latin'
  }
]```

5 Likes

Hey guys, thanks for all the hard work you are doing.
I’m a dev at ens.vision and I have two questions.

We have the Persian 999 club on our website. After the normalization update 973 of the 1000 three-digit names will be invalid and not resolvable. We already advised the Persian community not to renew their names. My question on their behalf is, will they receive a refund?

Also yesterday we listed new clubs of negative numbers after high demand from the community.
These also include negative Arabic numbers (Negative Arabic 99, Negative Arabic 999 and Negative Arabic 10k)
@raffy
Today we noticed that the latest updated to the normalization code, which we are already using, makes these numbers invalid as well (Error: mixed-script Arabic confusable: "-"). Is there a possibility to allow leading and trailing hyphen for Arabic numbers? Because that’s how negative Arabic numbers are written. (Whether it should be leading or trailing is another topic)

Thanks in advance :pray:

Only the core team would be able to speak authoritatively about refunds, so you’ll likely need to wait until they make some sort of announcement.

My guess would be that if the name was not able to be registered on the official ENS manager app/site in the first place, then no refund.

Alright, that would be great.
Persian numbers which will be invalid later can be registered currently.
Thanks.

My question is: was this the best outcome? Should we map all the extended digits for consistency? Or disallow them all instead? From earlier discussion – (oh that was you Octexor), we decided to map just the ones that were pixel identical.

Yes, I applied this change. Please let me know if there are other issues.
image

Yes, that’s the best we can do. Any other method would confuse the users. Because most Persians don’t use the Arabic ٤٥٦ digits, if they have a Persian keyboard.
We will update our Persian club with the new names soon. 34% of the numbers will be shared between Arabic and Persian, which is fine. Just like many words that are shared.

Nice, thank you!

There are some numbers that are still invalid.
The ones that only contain ١ or ٥.
١١١ ١١٥ ١٥١ ١٥٥ ٥١١ ٥١٥ ٥٥١ ٥٥٥
Same is true for Persian:
۱۱۱ ۱۱۵ ۱۵۱ ۱۵۵ ۵۱۱ ۵۱۵ ۵۵۱ ۵۵۵
The error is: whole script Arabic confusable
What is the reasoning behind it?

1 Like

The intention at present is to propose to the DAO to send refunds to anyone whose name was valid under the current normalisation scheme but not under the new one.

3 Likes

A whole script confusable is where the entire label is composed of characters that can look another label using a different script. Based on ordering, Latin got priority: eg. ٥٥٥ vs ooo etc.

Although similar, I agree the following use different scale and baseline in various fonts and platforms so I’ve removed them from the whole-script confusable list:

  • 661 (١) ARABIC-INDIC DIGIT ONE
  • 665 (٥) ARABIC-INDIC DIGIT FIVE
  • 6F5 (۵) EXTENDED ARABIC-INDIC DIGIT
  • 967 (१) DEVANAGARI DIGIT ONE
  • 966 (०) DEVANAGARI DIGIT ZERO

However, 966 and 655 are kinda pushing it. Input would be helpful.

image

Many of these are subjective. I’ve been using this list but I think many are too strict and there are a lot of things missing.


Another example is 4E00 vs 30FC. There are 4000 and 500 registrations respectively. I was thinking about disallowing 4E00 if Japanese (contains a Kana/Hira character).

image

Is there an easy way to allow 665 & 966 only if the other characters as of the same script

Yes 665 $ 966 might be confusable, but the rest of the characters used would correct it

I do know that it can then only produces a few names that could be confusable (matching numbers), would that be acceptable, I don’t know