ENS Name Normalization

As I said above, it’s a per-application UX issue.

I will include a beautifier function in the next release. Additionally, since there has been progress on the on-chain implementation, I will also add an on-chain beautifier function.

image

3 Likes

Can’t wait for it, beautiful ! Highly appreciated by the Ethmoji99 and Ethmoji999 community.

1 Like

FYI in the current deployed resolver tool, the “braille blank pattern” character shows up as valid: ENS Resolver

1 Like

Good catch. I’ll scan through the valid set for additional invisibles.


I greatly simplified my ENSIP proposal. I think this is much closer to what we should standardize. It includes:

I need to merge in the Arabic digit mapping we decided on above, update the tests to include the handwritten ones (from the prior IDNA 2008 approach), and then double-check that everything agrees.

I’ll then port this logic to my ens-normalize.js repo and release a compressed implementation (as those 2 data files are 1.2MB combined) and spinoff the remaining validation logic into a separate project.


We can follow this up with a matching on-chain implementation and then figure out what to do about validation: applying the more complex rules like single-script confusables, whole-script confusables, stupidly placed combining marks, check bidi, etc.


Edit: For Unicode 15, it looks like 20 emoji and 1 ZWJ sequence:
1F6DC,1FA75,1FA76,1FA77,1FA87,1FA88,1FAAD,1FAAE,1FAAF,1FABB,1FABC,1FABD,1FABF,1FACE,1FACF,1FADA,1FADB,1FAE8,1FAF7,1FAF8

1F426 200D 2B1B → Black Bird

2 Likes

I just had my first real-life run in with this issue. A user confused why their name had multiple owners. The issue was that the name contained (from what I understand) a persian 9 along with arabic digits.

1 Like

Going to keep happening, I have talked about it in the past on Twitter and warned people

Just for a single NNN name in Arabic/Persian there are 8 combinations

For a single NNNN name in Arabic/Persian there are 16 combinations

Non-ASCII characters on ENS are a mess and have been rushed out and approved without fully thinking it out

I think the mapping we discussed above is a fair solution.

It would be valuable to know if there are any other confusables that fit this pattern: where the end-user is frequently unaware of the difference due to script overlap (making a mapping the better solution.)

Note: "p" vs "р" [Latin 70 vs Cyrillic 440] doesn’t fit this pattern because those p's can be differentiated when surrounded by characters of the same script (that aren’t equally confusing.)

Got another confusable here:

Extended dash’s trying to be hyphens

1 Like

Yeah, that’s a good one, both em and en should be mapped to "-". I’ll check if there are any more Common dash-like characters.

2013 2014 2212

Edit: minus

1 Like

Does the mapping mean that if someone sent something to a name using a em/en by mistake, it would end up in the hyphen wallet

or would the transfer just not go through?

Mapping means all those hyphen-confusables get replaced with -.

In your example, sending something to “a—b.eth” would go to “a-b.eth”. But the larger point would be, “a—b.eth” isn’t valid.

Disallowing those characters is fine too, but hyphens are similar to the Arabic numerals in that they’re the same script, frequently used, but hard to visually distinguish.


Ultimately, I see two separate steps: normalization and validation.

Normalization makes it so everyone, given input, hashes the correct name. During this process, each character is either: an emoji sequence, valid, mapped to something else, ignored, or disallowed. Changes to this logic can impact previously registered names.

Validation checks if the name meets a bunch of criteria, like it’s an accepted script combination, doesn’t have whole script confusables, obeys bidirectional rules, doesn’t use characters in wrong contexts, etc. Validation can only reject or accept a name, it can’t change the hash.

If we can standardize Normalization (and have an on-chain implementation), then input names will always resolve to the expected result, which eliminates the problems with emoji, invisibles (ZWJ, ZWNJ, Braille Blank, etc.), and hard-confusables (Arabic numbers, hyphens, etc.).

Cheers for getting back to me :+1:

Would _ also be mapped to - ??

Underscore is disabled in IDNA 2003 but I was planning to enable it (along with $). I think mapping to hyphen is reasonable as well.

Are a-b and a_b too similar?

Definitely not.

Same physical key on a keyboard

I feel there will be many mistakes

I am biased but I feel it would be wrong to not map it to the hyphen name

I own several hyphen names and several hyphen pre-punk names

Think about companies like coca-cola, who own and use coca-cola.eth, then y-3.eth, g-starRAW.eth etc etc

All these companies will lose confidence in ENS if you then allow coca_cola.eth to be able to be registered

They would all need to fight again to try and secure their names at cost

In my view, do not annoy these companies as they may turn their back against ENS

It would also mean all those pre-punk names with hyphens would be able to be copied

Actually not just pre-punk names but all names with a hyphen in them, and there are plenty

I know Nick doesn’t like hyphen names, but don’t shot ENS in the foot, it’s already got it’s problems, don’t make another one

ENS could be huge, it is already becoming a beast and I don’t think it’s even started yet, but again with mismanagement it could also fail

I have been openly critical of Nick and his ideas of how to issue 2 & 1 character names, I still think the tax method he was suggesting is stupid, but that is just my opinion, if you don’t listen to opinions then it just becomes an echo-chamber and mistakes are made

regarding the $ sign

Does ENS really need it??

I’m not really seeing why it is needed

Remember that for mass adoption of ENS it needs to appeal to the masses. The masses don’t spend all day at a computer keyboard, simplicity will attract and keep the masses, make it too technical or have too many issues and it will scare away the masses or make them hesitant in adopting it

3 Likes

thank you raffy your work is appreciated!

6 Likes

Agree

Is this the outcome discussed and approved? I read through the most recent comments, but not exactly sure what the outcome was. Will the extended arabic (persian) digits, route to the regular arabic indic digits for the ones where there is overlap?

I believe they need to be mapped, if we want a good UX.

According to UTS-46 w/ Context O the recommendation for Arabic Numerals is never allow digit mixing. However, that still permits visually identical names for corresponding digits.

According to UAX-15 and visual inspection, 0-3,7-9 are confusable, so either you disallow those characters or pick a preferred one. The recommended solution is to convert to punycode, so the user sees a gibberish name, but now have the information necessary differentiate the confusables characters (if they know the correct punycode form.) For ENS, we don’t have an alternative input form.

  • Names should normalize or fail → How do I resolve xyz.eth?
  • Names should be valid/accepted/notconfusing or fail/warn. → Is xyz.eth a spoof?

Discussed? yes. Approved? no, that’s the purpose of this discussion!

2 Likes

Hyphen and underscore are distinct characters with distinct visual appearances. Both are already used in DNS, too, where they have different meanings. We definitely shouldn’t map them to the same character.