ENS Name Normalization

I haven’t been able to establish an algorithm worth standardizing that incorporates all of the validation checks, that prevents all known spoofs, and has a reasonable on-chain implementation.

My plan is to revert to my January release which fixes the Emoji and ZWJ issue and establishes the mechanical process of preparing a name for hashing. I will update my ENSIP to match. This should be done soon.

I also think it’s wise to use IDNA 2003 + whatever characters we discussed (underscore, +missing keycaps, +missing emoji, currency symbols) and remove ContextO, CheckBidi, CheckHyphens, etc. If there are any other characters worth enabling, please let me know.

A separate library (that won’t be part of the ENSIP) will be provided to apply validation checks. However, I think on-chain solutions might better.


For an on-chain implementation, I think the first step is writing a function that asserts if a name was normalized correctly: stringbool

I would consider names that match /^[a-z0-9_-.]+$/ w/Emoji as the “safest” set of ENS names (ignoring skin color emoji differences.) 94% of registered names fit this criteria.

My thinking is emoji parsing is independent of text parsing. We just need a function that takes a string abcXzYdef (where “abc” and “def” are text and XzY is an emoji ZWJ sequence) and produces a new string abcEEEdef, where E acts like a generic emoji placeholder. Then a second pass is made using a text filter.

UTF8.decode(string) -> uint24[] // revert if invalid
Emoji.filter(uint24[]) -> uint24[] // revert if middle of sequence
BasicValidator.validate(uint24[]) // revert if invalid codepoint

I’ve written two versions of the emoji filter contract. One works as a library and uses a state machine, it’s about 160K gas to validate 💩raffy🏳‍🌈.eth, the other is a contract that uses storage, and is about 60K gas.

2 Likes

Brainstorming: This might be stupid but what if we assert and store on-chain that a name is valid? The equivalent of a “blue” checkmark?

Imagine a master contract, with the following pseudocode:

interface IValidator {
    function validate(uint24[] cps);
}

mapping (uint256 => address) validated;
mapping (address => uint256) ranks;
address[] validators; 
address emoji; 

function setEmojiFilter(address) onlyDAO; // upgradable emoji 
function setValidator(address, uint256 rank) onlyDAO; // set 0 to invalidate

function hasEmoji(string name) returns (bool);
function getValidatorRank(address) returns (uint256);
function getNameRank(string name) returns (uint256);
function getNodeRank(uint256 node) returns (uint256);
function getBatchRank(string[] names) returns (uint256[]);

// return true if name is validated by validator
// return false is name is already validated by a better validator
// revert if name is invalid
function validate(address validator, string name, bool hasEmoji) returns (bool) {
    uint256 rank = getValidatorRank(validator);    
    require(rank > 0, "not a validator");
    uint256 node = namehash(name);
    address validator0 = validated[node];
    if (validator0 != address(0x0)) { // prior validation exists
        uint256 rank0 = getValidatorRank(validator0);
        if (rank0 > 0 && rank0 <= rank) return false; // prior rank is better
    }
    uint24[] cps = UTF8.decode(name);
    if (hasEmoji) cps = Emoji(emoji).filter(cps); 
    IValidator(validator).validate(cps);
    validated[node] = validator;
    return true; // valid
}

function willValidate(address validator, string name, bool hasEmoji) view {
    uint256 rank = getValidatorRank(validator);    
    require(rank > 0, "not a validator");
    uint24[] cps = UTF8.decode(name);
    if (hasEmoji) cps = Emoji(emoji).filter(cps); 
    IValidator(validator).validate(cps);
}

To check if a name is valid: getNameRank("raffy.eth") > 0

To validate a name (this could be one-button on a website or part of the registration/renew process):

  • You normalize your name.
  • You eth_call hasEmoji to see if your name has emoji.
  • You eth_call validators() to get a list of validator contracts.
  • You loop through the validators and find the most efficient one/best rank that willValidate()
  • You submit a tx for validate() with that contract, if your name has emoji, and the name. On success, it associates the node of the name with the address of contract that validated it.

Validations never expire. However, the underlying validator can get revoked by the master contract. Anyone can validate any name. Validator ranking can be used for filtering/trust and preventing someone from “downgrading” a strongly-validated name to more complex validator.

Validator ranking can also be used client-side: simply error if the name isn’t validated or if it has a rank above some threshold. Ranks could be associated with a color.

  0 = Invalid
100 = Basic /^[a-z0-9-_]+/
200 = Valid ASCII
200 = Just Arab Digits
300 = Latin (single-script)
300 = Greek (single-script, no overlapping confusables)

Validators only need to implement codepoint validation on non-emoji output characters.
Validators only need to validate a subset of the valid name space.

A separate contract will be needed to efficiently perform NFC QuickCheck (to ensure a name is in NFC) for complex charsets.


The DAO could vote to approve new validation contracts. Each language could have a separate/multiple validator contracts. You could even have a non-algorithmic validator contract that simply has an append-only list of approved names. The ranks of each validator can be adjusted by the DAO if a problem is found.

If the ENS registrar was provided a hint of what validator to use (possibly through the commitment), it could call validate during registration.

2 Likes

Hey, I recently registered some Indian devanagari digit names. They resolve fine currently. However, if I check them in the resolver under development at adraffy.github.io/ens-normalize.js/test/resolver.html, they give mixed results. Some digits normalize fine, while others are marked as confusable.

Here are the details for each digit:

9: 96F: Resolving
8: 96E: Not Resolving - Disallowed label “{96E}”: whole script confusing
7: 96D: Resolving
6: 96C: Resolving
5: 96B: Resolving
4: 96A: Not Resolving - Disallowed label “{96A}”: whole script confusing
3: 969: Not Resolving - Disallowed label “{969}”: whole script confusing
2: 968: Not Resolving - Disallowed label “{968}”: whole script confusing
1: 967: Not Resolving - Disallowed label “{967}”: whole script confusing
0: 966: Not Resolving - Disallowed label “{966}”: whole script confusing

Context: Devanagari is the dominant script in northen India, is used for hindi and a number of other languages, and is by far the most widely used script in India overall. Notably, Indian banknotes have both the English and the Devanagari numerals printed on them.

Would really appreciate if the team could the team provide a bit more clarity on how these confusable(s) are planned to be handled?

1 Like

You can ignore those errors. The latest release of my library was far too strict. My current recommendation is that we standardize basically everything we normalize as-of today (IDNA 2003) but only permit valid Emoji (and valid ZWJ only can appear inside emoji sequences).

As you describe, there is a lot of locale-specific logic that isn’t encoded in the Unicode rules. Most applications don’t have this problem because names that fail bidi/confusable checks can just use punycode as the alternative form. For ENS, we just get one shot: either the name is valid or it isn’t. There isn’t (although there could be) an alternative UX for indicating that the name should be reviewed.

The context you provide about English mixing with Devanagari is helpful, as that would give credence for Highly Restrictive being too strict and/or that a Latin+Devanagari combo script should be included.

Here is a whole-script confusable for one of your examples:
0AEE ; 096E ; MA # ( ૮ → ८ ) GUJARATI DIGIT EIGHT → DEVANAGARI DIGIT EIGHT #
The claim would be that one or both of ૮.eth and ८.eth are invalid.
I defaulted to both in my latest release.

2 Likes

Ah no. I can see how this is a tough nut to crack. My concern for my own digits brought me here, but the discussion is really interesting. Was reading through some of the alternative approaches you guys have been considering for the different issues working with unicode and scripts can cause, and loved the discussion so far. Will chip in where I can!

The context you provide about English mixing with Devanagari is helpful, as that would give credence for Highly Restrictive being too strict and/or that a Latin+Devanagari combo script should be included.

For sure. Mixing english with the local language is really really common here in South Asia. Although in Pakistan the norm is to use a Romanized form of the local language online. So probably not an issue. But use of the local script online is also increasing in both Pakistan and India, so i can foresee that becoming an issue.

2 Likes

I reverted the demo to what I think should be standardized and computed another error report.

  • UTS-51 Emoji parsing happens first. It does all RGI sequences. It also includes the non-RGI whitelist. Any emoji that was mangled by IDNA 2003 is removed. In all cases FE0F is dropped. FE0E ends emoji parsing.
  • UTS-46 using IDNA 2003, STD3, deviations are valid. No CheckHyphens. No punycode. No combining mark restrictions. No ContextO. No CheckBidi. No script restrictions. No confusables. $ and _ are allowed. The alternative stops (3002 FF0E FF61) are disabled. ZWJ and ZWNJ outside of emoji are disabled.

I am updating my ENSIP to describe the process above.


I added a new feature to the demo which indicates that a block of text requires NFC. The following example is a mapped character, an ignored character, and a valid character, that when merged together (where the ignored character is removed) gets rearranged: C1 FE0F 325 32610E1 326 301.
image


This algorithm is simple enough that it can be implemented on-chain and it’s development isn’t blocked by the fine details of confusables.

@royalfork is developing an stringstring implementation. I am currently pursuing the validation approach I described above: given a normalized name (via ens-normalize.js or an on-chain implementation) determine if is valid (trusted, reachable, not spoofed, etc. – I’m not sure what the right terminology is.)

I deployed an EmojiParser and BasicValidator (/^[a-z0-9_-.]+$/) that currently validates 94% of ENS names on-chain.

The next contract I am making is NFC quick check (uint24[]bool) and/or NFC (uint24[]uint24[]). With that, it should be pretty straight forward to write multiple single-script validation contracts to increase the coverage from 94% to 99%.


In terms of what is and isn’t valid, I’m still not sure. Because emoji parsing is independent of text processing (emoji can be mixed with any script, etc.), validation only needs to care about non-emoji confusables and other exotic Unicode rules. By using separate validators, we simply need to chip away at the problem.

Here are the script combinations of registered names sorted by frequency:
image

We could probably validate 95%+ of each script using just exemplars and double-checking for confusables.

For example, making an on-chain validator for those 70 pure Thai names is probably trivial. There are only 86 characters in the Thai script and the exemplars are even smaller. If we can agree which of those characters are confusable, the corresponding validator is only a couple lines of code.

2 Likes

It’s a neat idea, but I’m not generally a big fan of making people pay gas to compute something onchain that’s mostly only used offchain. Any function that’s used to read this checkmark status could instead be written to calculate it at read time.

Any updates on this?

As I said above, it’s a per-application UX issue.

I will include a beautifier function in the next release. Additionally, since there has been progress on the on-chain implementation, I will also add an on-chain beautifier function.

image

3 Likes

Can’t wait for it, beautiful ! Highly appreciated by the Ethmoji99 and Ethmoji999 community.

1 Like

FYI in the current deployed resolver tool, the “braille blank pattern” character shows up as valid: ENS Resolver

1 Like

Good catch. I’ll scan through the valid set for additional invisibles.


I greatly simplified my ENSIP proposal. I think this is much closer to what we should standardize. It includes:

I need to merge in the Arabic digit mapping we decided on above, update the tests to include the handwritten ones (from the prior IDNA 2008 approach), and then double-check that everything agrees.

I’ll then port this logic to my ens-normalize.js repo and release a compressed implementation (as those 2 data files are 1.2MB combined) and spinoff the remaining validation logic into a separate project.


We can follow this up with a matching on-chain implementation and then figure out what to do about validation: applying the more complex rules like single-script confusables, whole-script confusables, stupidly placed combining marks, check bidi, etc.


Edit: For Unicode 15, it looks like 20 emoji and 1 ZWJ sequence:
1F6DC,1FA75,1FA76,1FA77,1FA87,1FA88,1FAAD,1FAAE,1FAAF,1FABB,1FABC,1FABD,1FABF,1FACE,1FACF,1FADA,1FADB,1FAE8,1FAF7,1FAF8

1F426 200D 2B1B → Black Bird

2 Likes

I just had my first real-life run in with this issue. A user confused why their name had multiple owners. The issue was that the name contained (from what I understand) a persian 9 along with arabic digits.

1 Like

Going to keep happening, I have talked about it in the past on Twitter and warned people

Just for a single NNN name in Arabic/Persian there are 8 combinations

For a single NNNN name in Arabic/Persian there are 16 combinations

Non-ASCII characters on ENS are a mess and have been rushed out and approved without fully thinking it out

I think the mapping we discussed above is a fair solution.

It would be valuable to know if there are any other confusables that fit this pattern: where the end-user is frequently unaware of the difference due to script overlap (making a mapping the better solution.)

Note: "p" vs "р" [Latin 70 vs Cyrillic 440] doesn’t fit this pattern because those p's can be differentiated when surrounded by characters of the same script (that aren’t equally confusing.)

Got another confusable here:

Extended dash’s trying to be hyphens

1 Like

Yeah, that’s a good one, both em and en should be mapped to "-". I’ll check if there are any more Common dash-like characters.

2013 2014 2212

Edit: minus

1 Like

Does the mapping mean that if someone sent something to a name using a em/en by mistake, it would end up in the hyphen wallet

or would the transfer just not go through?

Mapping means all those hyphen-confusables get replaced with -.

In your example, sending something to “a—b.eth” would go to “a-b.eth”. But the larger point would be, “a—b.eth” isn’t valid.

Disallowing those characters is fine too, but hyphens are similar to the Arabic numerals in that they’re the same script, frequently used, but hard to visually distinguish.


Ultimately, I see two separate steps: normalization and validation.

Normalization makes it so everyone, given input, hashes the correct name. During this process, each character is either: an emoji sequence, valid, mapped to something else, ignored, or disallowed. Changes to this logic can impact previously registered names.

Validation checks if the name meets a bunch of criteria, like it’s an accepted script combination, doesn’t have whole script confusables, obeys bidirectional rules, doesn’t use characters in wrong contexts, etc. Validation can only reject or accept a name, it can’t change the hash.

If we can standardize Normalization (and have an on-chain implementation), then input names will always resolve to the expected result, which eliminates the problems with emoji, invisibles (ZWJ, ZWNJ, Braille Blank, etc.), and hard-confusables (Arabic numbers, hyphens, etc.).

Cheers for getting back to me :+1:

Would _ also be mapped to - ??