On-chain ENS Domain Normalization

Here is an initial version of an efficient emoji parser library that should be a drop-in for your contract.

I deployed a contract using it here and populated it.

It requires less than 1 eth in gas to deploy and supports all Unicode 14 emoji. (If the upload mechanism is exposed, it’s also upgradeable, but since it’s so cheap, its probably better to seal it and redeploy for future Unicode updates.)

  • function read(Emoji storage self, uint24[] memory cps, uint256 pos, bool skipFE0F) returns (uint256) — will return the length of the next emoji token, if it exists. If skipFE0F is true, it will effectively ignore FE0F (but it will also return a non-zero result for a run of FE0F's without any emoji.)

  • function filter(Emoji storage self, uint24[] memory cps, bool skipFE0F) returns (uint24[]) — will replace emoji tokens with a single 0xFFFFFF.

If you use as a library, you’ll need to call upload() with the following 10KB payload, broken into (2) chunks:

Chunk 1
0x
Chunk 2


The chunks are 6 byte rules for state transitions which follow this tree:


Here is a test contract that I’ve set the above contract as the emoji parser and a BasicValidator contract that implements /^[a-z09-_.]+/$ as the validator.

  • filter(string name, bool ignoreFEOF) should parse any UTF8 string with emojis.

  • validate(string name) should validate any normalized name which uses valid emoji or obeys the regex above. Note: This will fail if emoji have FE0F (which isn’t normalized). This should validate 94% of registered names as-of today.

1 Like

Outstanding! Do you have an estimate of what it would take to extend this to match your entire normalisation function, not just Emoji?

It depends on how IDNA and NFC are implemented. If you implement them verbatim, it will be very expensive to deploy like royalfork estimated. However, there’s probably a partial mapping that would give a lot of bang for the buck.

I think there is a possibility of using NFKD → Apply the small number of differences from IDNA 2003 → NFC, but I’d need to verify that this is the same thing as IDNA 2003 → NFD → NFC. You could avoid a whole lookup table with this approach.

For IDNA, we can pack 10 uint24 per storage slot for single char mappings. There are runs of “ABCabc” and “AaBbCc” for casefolding, and runs of “ABC”, “A?B?C?”, and “A??B??C??” that map to a constant. There’s a bunch of ideas in my github.

If, for the moment, we reject combining marks and composite characters (so NFC isn’t needed) then my emoji parser + an efficient IDNA mapping would be a string → string implementation. All the effort would go into efficiently compressing the mapping.

Otherwise, I think the most important contract is an NFD/NFC implementation.


For validation:

You can naively encode the valid output character set into a bitmap for 700 storage slots. Combine that with the emoji contract above + NFC quick check and that’s a gas efficient function for determining if a name is in the normalized form. I can demo this immediately once I have an NFC implementation.

I also think a few separate simple validators for each script would give a lot of coverage (and more could be added over time). However, this approach requires a mechanism for looping through the set of validators to find one which returns true for your name.

I mentioned in the other thread, you could store the address of validator on-chain for the name you verified, making all future validations only two SLOADs, node → validator address → is valid validator address, but this requires extra infrastructure.

1 Like

This seems to be a reasonable starting point. Are there existing domains that use combining marks or composite characters?

1 Like

5K names have combined characters (NFC != NFD.)
171 names have two combining marks that could be reordered.

Codepoints: 2.4K CM and 13K composed.

Maybe that is a good idea, just compress the IDNA part and disallow the codepoints that need NFC. That’s 99% of names unless I’m forgetting something.


@adraffy/ens-norm-research has an improved contract that only needs 360 storage slots and no longer requires the skipFE0F option, and the code that generates the state payload.


Emoji + IDNA 2003 - (Combining Marks, Compositions) can be done with 10KB contract code, 128 slots for Valid bitmap, 450 Slots for Mappings of [1-2 CP], 175 Slots for Mappings [3+ CP], 360 slots for Emoji.

[128 + 452 + 175 + 360] → 1115 slots → ~30M + contract gas so under 1 eth to deploy.

122810 valid, 4208 mapped, 9 ignored, 3046 emoji sequences.

Since the IDNA table is stable, a huge chunk of the valid and single character mappings can be compiled directly to code.

I think worse-case per character would be, ~15 comparisons (x < A), 10 range comparisions (x >= A && x <= B), 1 sload for valid, 1 sload for small mapping, 1 sload for big mapping, and then calling the emoji function which is (1 sload per codepoint.)


Edit: Here is my first working version: https://rinkeby.etherscan.io/address/0xdb68eb6f0ab93bc7059f2752d477b45f04457eb2#readContract

Cost 32M gas to deploy and populate.

image

Should work for every name except those with combining marks or characters that compose (like à).


Edit 2: I deployed a more gas efficient version that avoids using uint24[]. It also has a beautify() function.

I’m seeing about 50-100K gas per name.

Although I have little idea what you’re on about half the time here, your commitment to see this all the way through is admirable, @raffy :1st_place_medal:

1 Like

I got a version with NFC Quick Check, which means only disordered combining marks or decomposed characters will incorrectly fail to normalize.

https://rinkeby.etherscan.io/address/0x335be342669ae015d7a87eb7c632447f9218254b
43M gas

🤼🏻‍♂️Ⅷ👩🏽‍⚕️_$🇦ß.۱۲۳👨‍👩‍👦9️⃣.ѐ🏴󠁧󠁢󠁥󠁮󠁧󠁿.eth

How will they be handled? Do they throw an error?

Roughly how much gas does normalize cost?

NFC Quick Check failures throw NotNFC().
Disallowed throw InvalidCodepoint(uint256 cp).

  • 🤼🏻‍♂️Ⅷ👩🏽‍⚕️_$🇦ß.۱۲۳👨‍👩‍👦9️⃣.ѐ🏴󠁧󠁢󠁥󠁮󠁧󠁿.eth → 180K gas
  • 💩💩💩.eth → 50K gas
  • RAFFY.ETH → 60K gas
  • maybe-its-time-to-stop-registering-garbage-like-this-and-buy-premiums-from-secondary-instead-such-as-3-or-4-digits-or-clean-one-word-domains-like-soy.eth → 280K gas

An ASCII fastpath in-front of the emoji checker would probably save a lot of gas. Only saves 1 sload.

Edit: Actually, once both emoji and valid are known, a fast path bitmap can be stored in a single uint256, that returns true for any valid character that isn’t an emoji prefix. At the moment, that’s true for 81 of 0-255. That would make DNS names much cheaper if we need this kind of optimization.

  • raffy.eth → 29K gas
  • maybe-its-time-to-stop-registering-garbage-like-this-and-buy-premiums-from-secondary-instead-such-as-3-or-4-digits-or-clean-one-word-domains-like-soy.eth → 111K gas (save 170K)

Another question is whether the contract ultimately should be adjustable. I’m not sure yet the best way to encode the large tree of codepoint ranges → codepoints (I’m using a ~700 line function.)

This sounds reasonable to me. Is there a reason we would need to implement the full NFC, in this context? Rejecting invalid normalisations seems sufficient to me.

Not bad! We shouldn’t generally have to do these in a transaction, but it’s nice to know it’d be viable if we really had to. Do you think there’s much room for gas optimisation here?

I’d be in favor of immutable contracts that can be replaced as needed; this tends to save gas anyway, and makes verificaiton much easier.

We should have full NFC for when someone inputs decomposed characters.

Now that QC works, I think I’ll just implement the algorithm as-is in my next release, since it will be invoked rarely.

Good to know.

Can we not simply reject such names as invalid?

We’ll have to define “cheap”, but I believe Unicode adds new characters once a year-ish, so if each deployment is >$1k (or operationally complex with multiple transactions), I’d think that easy upgradeability should be a design requirement (ie: new code points can be set to “valid” by a single transaction).

I’d say yes, for the same reason that you’d also reject labels which contain minimally-qualified emoji.


What do we think about the following top-level API?

/**
 * @notice Compute namehash of domain after ENS normalization and
 *         validation.  Reverts if domain contains disallowed 
 *         codepoints, disordered combining marks, or decomposed characters.
 * @param domain Domain to namehash.
 * @return normalized Normalized domain, to be displayed to end-users.
 * @return node Namehash of domain.
 * @return charsets Bit array of unicode character sets present in the domain.
 */
function namehash(string memory domain) public view returns (string memory normalized, bytes32 node, uint256 charsets) {

// Unicode charsets
var (
	ASCII = 1 << iota
	Latin-1
	Greek
	Cryllic
	Emoji
	...
)

This includes a charsets return value, which tells the caller which charsets are present in the domain. I think this would make ENS much safer and inclusive for international users. For example, English sites could “flag as suspicious” any domain where charsets != ASCII & Emoji, and Chinese sites could flag charsets != Chinese & ASCII, etc.

Open Questions:

  • This function doesn’t really play well with the existing “UniversalResolver”. Should this function operate only on labels? Or should it have tighter integration with Universal Resolver? Either way, I think users should have an on-chain function which provides direct string -> (...helpful ENS info...) functionality.
  • I’m not wild about the namehash function name, since this returns more than just the namehash. Any suggestions for a better name?

@raffy That is some really impressive code you have there! Very well done :slight_smile:

Here are some thoughts:

Assume max label length fits in 16 bits (or w/e). The low level normalize should be string → (string norm, string normNoEmoji, uint256 labelData) where:

  • normNoEmoji is the same string but each emoji sequence is zero’d
    eg. :poop: (4 bytes) → [0,0,0,0]
  • each labelData is:
    • 16 bit → label length
    • 240 bits → bitset of active non-emoji codepoints shifted by 14
      (2^21/240 => 14 bits)

From that, it’s easy to compute the namehash, extract out any label, compute any label hash, or quickly determine which validation could apply. eg. the basic validator (DNS + Emoji) only runs if the bitset is 0x1. normNoEmoji avoids processing emoji again during validation.

The primary normalize function should be string → (string, hash) like you describe.


I think all the charset stuff should exist in the validator, whether that’s the same contract or a different contract, I’m not sure. Most validator checks are per-label, so given (label, bitset), you can efficiently check if its valid. IIRC, only check bidi requires a full-name check.

The low level validation function would be (string, bitset) → bool like:

function validate(string label, uint256 bitset) returns (bool) {
     // any 0 byte is a previously processed emoji
     if (bitset == 1) { // only codepoints [0, 0x4000)
         // check if basic
         // check if latin non-confusable
         // check if greek w/o wholescript confusable
         // etc...
     }
     return false;
}

Or more cleanly, the validation contract could have arbitrary functions key’d by bitset + nonce, so you could register/unregister any number of validation functions, and quickly check which ones could apply by intersecting the label bitset, etc.


The primary normalize + validation function would be string → (string norm, uint256 hash, bool valid) where:

given name
(norm, normNoEmoji, labelData) = normalize(name)
start = 0
hash = 0
valid = true
// possibly apply full-name check
for [len, bitset] of labelData.reverse()
    hash = keccak(hash + keccak(norm.slice(start, len)))
    valid &= validate(normNoEmoji.slice(start, len), bitset)
    start += 1 + len
return (norm, hash, valid)
  • If this throws, the name is invalid
  • If valid is false, the user should get some kind of warning that the name is potentially unsafe (where unsafe means one or more labels satisfied 0 approved validators.)
  • If valid isn’t needed, use the primary normalize function instead.

I haven’t thought much about what the internal API should be, but this certainly looks reasonable. Just to clarify, is the motivation behind the emoji special case (norm vs normNoEmoji) to sort out the 0xfe0f issue? (ie: 0xfe0f must be present for “input/display normalization”, but be absent for namehash calculation?)

Would this also be part of the internal API? I don’t thing the nuance around valid should be exposed to end-users.

Why wouldn’t you also throw in this case?


I think the first step is to build general consensus around a top-level public API to be used by ecosystem developers. Once we have that, I’m pretty confident we can work backwards to nail down the low-level internals.

Validation only needs to check if the non-emoji part of the string obeys all the Unicode rules. But you need to run the full emoji logic to skip over them, so each validator would then need efficient access to the emoji logic. Whereas, allocating a little more memory and keeping a second copy with zero’d emoji bytes means each validators can just skip over 0’s and be fully ignorant of emoji.

This lets you construct the basic validator /[a-z0-9_-]+/ (and many others) very easily.

This is my opinion but I think normalizable-but-confusing names should still work. Especially for the headless situation, eg. aрe.eth confusable can normalize and hash, but valid would be false.

There exist creative/vanity names that will never satisfy even the most relaxed script-based rules simply because they use a single-character of a script that’s confusable in other contexts.

Validity depends on context. Within an English-speaking context, aрe.eth should be flagged, but within a Russian-speaking context, something like овечкин8.eth shouldn’t be flagged. Since they both contain ASCII+Cryllic, I don’t think a simple yes/no validity boolean is enough to capture this distinction.

I don’t think we should dictate what combinations of character sets are or aren’t valid. Instead, we should provide an easy way for ecosystem developers to know what’s actually inside an input domain, and provide simple building blocks to construct custom validity rules/guidelines. I believe the (uint256 charsets) return value provides this.

Taking your example, the English-speaking ecosystem developer would only need to write valid = (charset == ASCII) to flag aрe.eth as invalid.

I just wanted to check in and see if on-chain normalization is being worked on, or if there are any status updates as the last reply to this thread was in July.

I wrote a Solidity Unicode NF implementation so we have all of the parts necessary.

I just need to integrate it into the previous contract and upload the payloads according to the latest spec and we should be 100% match.


Edit: This is really hacky and untested but appears to work just by gluing some pieces together. It follows the latest ENSIP rules. It has underscore, label extension (double hyphen), and combining mark (leading, emoji, adjacent) logic. Beautifier works too (missing the regional indicator separator logic.)
https://rinkeby.etherscan.io/address/0x1E321B0fdbbc022D959F5D9eB829f071cC375131#readContract

  • normalize() gives array of labels (unfortunately if you copy from etherscan periods will be commas)
  • normhash() goes straight from unnormalized to namehash

Deployment gas was 70M, up from 43M. Needs optimization but “RAFFY.eth” was under 70K gas.


There’s a question of what the API should be?

4 Likes

That’s absolutely beautiful @raffy, thanks for the update and for your hard work! :slight_smile:

3 Likes