ENS Name Normalization

Hey, I recently registered some Indian devanagari digit names. They resolve fine currently. However, if I check them in the resolver under development at adraffy.github.io/ens-normalize.js/test/resolver.html, they give mixed results. Some digits normalize fine, while others are marked as confusable.

Here are the details for each digit:

9: 96F: Resolving
8: 96E: Not Resolving - Disallowed label “{96E}”: whole script confusing
7: 96D: Resolving
6: 96C: Resolving
5: 96B: Resolving
4: 96A: Not Resolving - Disallowed label “{96A}”: whole script confusing
3: 969: Not Resolving - Disallowed label “{969}”: whole script confusing
2: 968: Not Resolving - Disallowed label “{968}”: whole script confusing
1: 967: Not Resolving - Disallowed label “{967}”: whole script confusing
0: 966: Not Resolving - Disallowed label “{966}”: whole script confusing

Context: Devanagari is the dominant script in northen India, is used for hindi and a number of other languages, and is by far the most widely used script in India overall. Notably, Indian banknotes have both the English and the Devanagari numerals printed on them.

Would really appreciate if the team could the team provide a bit more clarity on how these confusable(s) are planned to be handled?

1 Like

You can ignore those errors. The latest release of my library was far too strict. My current recommendation is that we standardize basically everything we normalize as-of today (IDNA 2003) but only permit valid Emoji (and valid ZWJ only can appear inside emoji sequences).

As you describe, there is a lot of locale-specific logic that isn’t encoded in the Unicode rules. Most applications don’t have this problem because names that fail bidi/confusable checks can just use punycode as the alternative form. For ENS, we just get one shot: either the name is valid or it isn’t. There isn’t (although there could be) an alternative UX for indicating that the name should be reviewed.

The context you provide about English mixing with Devanagari is helpful, as that would give credence for Highly Restrictive being too strict and/or that a Latin+Devanagari combo script should be included.

Here is a whole-script confusable for one of your examples:
0AEE ; 096E ; MA # ( ૮ → ८ ) GUJARATI DIGIT EIGHT → DEVANAGARI DIGIT EIGHT #
The claim would be that one or both of ૮.eth and ८.eth are invalid.
I defaulted to both in my latest release.

2 Likes

Ah no. I can see how this is a tough nut to crack. My concern for my own digits brought me here, but the discussion is really interesting. Was reading through some of the alternative approaches you guys have been considering for the different issues working with unicode and scripts can cause, and loved the discussion so far. Will chip in where I can!

The context you provide about English mixing with Devanagari is helpful, as that would give credence for Highly Restrictive being too strict and/or that a Latin+Devanagari combo script should be included.

For sure. Mixing english with the local language is really really common here in South Asia. Although in Pakistan the norm is to use a Romanized form of the local language online. So probably not an issue. But use of the local script online is also increasing in both Pakistan and India, so i can foresee that becoming an issue.

2 Likes

I reverted the demo to what I think should be standardized and computed another error report.

  • UTS-51 Emoji parsing happens first. It does all RGI sequences. It also includes the non-RGI whitelist. Any emoji that was mangled by IDNA 2003 is removed. In all cases FE0F is dropped. FE0E ends emoji parsing.
  • UTS-46 using IDNA 2003, STD3, deviations are valid. No CheckHyphens. No punycode. No combining mark restrictions. No ContextO. No CheckBidi. No script restrictions. No confusables. $ and _ are allowed. The alternative stops (3002 FF0E FF61) are disabled. ZWJ and ZWNJ outside of emoji are disabled.

I am updating my ENSIP to describe the process above.


I added a new feature to the demo which indicates that a block of text requires NFC. The following example is a mapped character, an ignored character, and a valid character, that when merged together (where the ignored character is removed) gets rearranged: C1 FE0F 325 32610E1 326 301.
image


This algorithm is simple enough that it can be implemented on-chain and it’s development isn’t blocked by the fine details of confusables.

@royalfork is developing an stringstring implementation. I am currently pursuing the validation approach I described above: given a normalized name (via ens-normalize.js or an on-chain implementation) determine if is valid (trusted, reachable, not spoofed, etc. – I’m not sure what the right terminology is.)

I deployed an EmojiParser and BasicValidator (/^[a-z0-9_-.]+$/) that currently validates 94% of ENS names on-chain.

The next contract I am making is NFC quick check (uint24[]bool) and/or NFC (uint24[]uint24[]). With that, it should be pretty straight forward to write multiple single-script validation contracts to increase the coverage from 94% to 99%.


In terms of what is and isn’t valid, I’m still not sure. Because emoji parsing is independent of text processing (emoji can be mixed with any script, etc.), validation only needs to care about non-emoji confusables and other exotic Unicode rules. By using separate validators, we simply need to chip away at the problem.

Here are the script combinations of registered names sorted by frequency:
image

We could probably validate 95%+ of each script using just exemplars and double-checking for confusables.

For example, making an on-chain validator for those 70 pure Thai names is probably trivial. There are only 86 characters in the Thai script and the exemplars are even smaller. If we can agree which of those characters are confusable, the corresponding validator is only a couple lines of code.

2 Likes

It’s a neat idea, but I’m not generally a big fan of making people pay gas to compute something onchain that’s mostly only used offchain. Any function that’s used to read this checkmark status could instead be written to calculate it at read time.

Any updates on this?

As I said above, it’s a per-application UX issue.

I will include a beautifier function in the next release. Additionally, since there has been progress on the on-chain implementation, I will also add an on-chain beautifier function.

image

3 Likes

Can’t wait for it, beautiful ! Highly appreciated by the Ethmoji99 and Ethmoji999 community.

1 Like

FYI in the current deployed resolver tool, the “braille blank pattern” character shows up as valid: ENS Resolver

1 Like

Good catch. I’ll scan through the valid set for additional invisibles.


I greatly simplified my ENSIP proposal. I think this is much closer to what we should standardize. It includes:

I need to merge in the Arabic digit mapping we decided on above, update the tests to include the handwritten ones (from the prior IDNA 2008 approach), and then double-check that everything agrees.

I’ll then port this logic to my ens-normalize.js repo and release a compressed implementation (as those 2 data files are 1.2MB combined) and spinoff the remaining validation logic into a separate project.


We can follow this up with a matching on-chain implementation and then figure out what to do about validation: applying the more complex rules like single-script confusables, whole-script confusables, stupidly placed combining marks, check bidi, etc.


Edit: For Unicode 15, it looks like 20 emoji and 1 ZWJ sequence:
1F6DC,1FA75,1FA76,1FA77,1FA87,1FA88,1FAAD,1FAAE,1FAAF,1FABB,1FABC,1FABD,1FABF,1FACE,1FACF,1FADA,1FADB,1FAE8,1FAF7,1FAF8

1F426 200D 2B1B → Black Bird

2 Likes

I just had my first real-life run in with this issue. A user confused why their name had multiple owners. The issue was that the name contained (from what I understand) a persian 9 along with arabic digits.

1 Like

Going to keep happening, I have talked about it in the past on Twitter and warned people

Just for a single NNN name in Arabic/Persian there are 8 combinations

For a single NNNN name in Arabic/Persian there are 16 combinations

Non-ASCII characters on ENS are a mess and have been rushed out and approved without fully thinking it out

I think the mapping we discussed above is a fair solution.

It would be valuable to know if there are any other confusables that fit this pattern: where the end-user is frequently unaware of the difference due to script overlap (making a mapping the better solution.)

Note: "p" vs "р" [Latin 70 vs Cyrillic 440] doesn’t fit this pattern because those p's can be differentiated when surrounded by characters of the same script (that aren’t equally confusing.)

Got another confusable here:

Extended dash’s trying to be hyphens

1 Like

Yeah, that’s a good one, both em and en should be mapped to "-". I’ll check if there are any more Common dash-like characters.

2013 2014 2212

Edit: minus

1 Like

Does the mapping mean that if someone sent something to a name using a em/en by mistake, it would end up in the hyphen wallet

or would the transfer just not go through?

Mapping means all those hyphen-confusables get replaced with -.

In your example, sending something to “a—b.eth” would go to “a-b.eth”. But the larger point would be, “a—b.eth” isn’t valid.

Disallowing those characters is fine too, but hyphens are similar to the Arabic numerals in that they’re the same script, frequently used, but hard to visually distinguish.


Ultimately, I see two separate steps: normalization and validation.

Normalization makes it so everyone, given input, hashes the correct name. During this process, each character is either: an emoji sequence, valid, mapped to something else, ignored, or disallowed. Changes to this logic can impact previously registered names.

Validation checks if the name meets a bunch of criteria, like it’s an accepted script combination, doesn’t have whole script confusables, obeys bidirectional rules, doesn’t use characters in wrong contexts, etc. Validation can only reject or accept a name, it can’t change the hash.

If we can standardize Normalization (and have an on-chain implementation), then input names will always resolve to the expected result, which eliminates the problems with emoji, invisibles (ZWJ, ZWNJ, Braille Blank, etc.), and hard-confusables (Arabic numbers, hyphens, etc.).

Cheers for getting back to me :+1:

Would _ also be mapped to - ??

Underscore is disabled in IDNA 2003 but I was planning to enable it (along with $). I think mapping to hyphen is reasonable as well.

Are a-b and a_b too similar?

Definitely not.