On-chain ENS Domain Normalization

Way back in December, I made the following comment in our now infamous ENS Name Normalization thread:

After taking some time to look into it, I believe that an onchain implementation of “ENS Name Normalization” (as proposed by @raffy in Draft ENSIP - Standardization of ENS Name Normalization) is both technically feasible and financially economic within a mainnet contract.

As a small proof of concept, I built github.com/royalfork/ens-ascii-normalizer, which you can try out on etherscan. ENSNormalizeAscii contains the following functions:

  • namehash(string domain) returns (string, bytes32) returns the normalized domain, and the namehash node of the normalized domain (reverting if domain is invalid).
  • owner(string domain) returns (address) uses namehash to return the domain’s current ENS owner.

Note: This contract only supports ASCII-only domains. If a domain contains non-ASCII characters, namehash/lookup will revert.

Moving to full unicode support:

  • utf8 encoding/decoding incurs additional complexity and computing cost over just ASCII domains. In preliminary tests, validation of a simple 6 letter unicode domain cost about 200000 gas (~$5), which is probably too expensive to be called by other contracts (although this number is based on unoptimized solidity code, and could be reduced).
  • IDNA2008 has ~1 million “unicode” rules which would all need to be supported. These rule can be compressed (run length encoding) into a set of ~2000 “rule import” eth transactions, each costing ~1,500,000 gas. This puts the total cost of IDNA2008 rule importation at around $85,000 (at 30 gwei gas).
  • Emoji validation requires importation of another ~3000 rules (based on saving the full RGI set on-chain, but it might be possible to do this with less). Not sure on the best data structures for this part yet, but would estimate that emoji validation data import would cost another $20,000-$50,000 in gas (I can try and firm up a better estimate if needed).
  • Most UTS-46 mapping would be encoded into the IDNA2008 rule set, but “1 char → n char” mappings would require separate importation. There are only a few of these, but this could add another ~$10,000 to deployment costs (also, implementing unicode 1->n mappings isn’t as simple as the ascii 1->1 mappings, some inline assembly is required to make those mappings memory efficient).
  • Once on-chain, the rules would be customizable, so an “owner” could allow/disallow certain characters and/or change how context-based validation works (emoji validation is implemented via context-based validation).

Not sure if ENSAsciiNormalizer is the first mainnet “string → owner” contract for ENS, but I find it nice to perform direct on-chain domain lookup/validation without the need for intermediary client-side code (it’s also nice to check for ZWJ characters directly on etherscan). If there’s any interest in a “full unicode normalization” contract (and we can collectively stomach the ~$200,000 in fees it would take us to get there), I can put together a more detailed RFP. Would be pretty cool to implement something like this, and it could vastly simplify 3rd party integrations (instead of asking etherscan, metamask, opensea, ethers.js, everyone else to “update their code” everytime ENS normalization rules are updated, we can just point them to our easy-to-use on-chain normalization function and be done with it).

6 Likes

Onchain normalisation would be really useful to have; it would mean that implementations can start by using it, and transition to doing it in the client library for efficiency later, if desired.

Gas efficiency for normalisation needn’t be the top concern; normally there’d be no reason to do the normalisation inside a transaction - instead, clients can call the normalisation function to get the namehash, and pass that in to the transaction.

The costs to upload the tables will be substantial, but I think that with the preprocessing and compression work @raffy has done, it’s probably viable for a much smaller sum.

6 Likes

I extended this into a proof-of-concept for emoji+ascii normalization, and deployed on ropsten here: https://ropsten.etherscan.io/address/0x31096216008a6bc55ba6434488d14c86bdcf4283#code.

This contract maintains a list of all canonical/allowed emoji sequences, and only allows emoji within that set. ZWJ characters are only allowed within emoji sequences, and are only allowed in those emoji sequences that specifically use them, where they must match the allowed sequence verbatim. There’s also some special handling for 0xFE0F such that only fully-qualified emoji are returned as “normalized”, but the minimally-qualified, without 0xFE0F, is used to compute the namehash/labelhash (meaning this is fully backwards-compatible with existing emoji registrations).

This PoC implementation wastes a lot of storage (with expensive “nested maps”), but even in its unoptimized state, saving a large, 5-char-long emoji sequence into the allowlist costs 140k gas (https://ropsten.etherscan.io/tx/0x9ecf9c9665f415e5c97511489efa5185ab3d82ec0c62b8f7a88fa34023462195), which is about $13 on mainnet (30gwei). Saving 3000 RGI emoji would cost worst case $40k, but most likely a lot less (@raffy has some ideas on how this can be substantially improved).

Note: For this deployment, I only added :seven: and :policewoman:t2: to the emoji allowlist…all other emoji will revert if you try and namehash, because they aren’t in the allowlist.

2 Likes

@royalfork - Awesome work. Thanks for taking the time to share the concept and PoC it.

@slobo.eth - Could we get this on the agenda for this week’s Ecosystem WG meeting? It’d be great to discuss this synchronously.
We’re back to 7pm GMT on Monday this week, correct?

1 Like

Here is an initial version of an efficient emoji parser library that should be a drop-in for your contract.

I deployed a contract using it here and populated it.

It requires less than 1 eth in gas to deploy and supports all Unicode 14 emoji. (If the upload mechanism is exposed, it’s also upgradeable, but since it’s so cheap, its probably better to seal it and redeploy for future Unicode updates.)

  • function read(Emoji storage self, uint24[] memory cps, uint256 pos, bool skipFE0F) returns (uint256) — will return the length of the next emoji token, if it exists. If skipFE0F is true, it will effectively ignore FE0F (but it will also return a non-zero result for a run of FE0F's without any emoji.)

  • function filter(Emoji storage self, uint24[] memory cps, bool skipFE0F) returns (uint24[]) — will replace emoji tokens with a single 0xFFFFFF.

If you use as a library, you’ll need to call upload() with the following 10KB payload, broken into (2) chunks:

Chunk 1

Chunk 2


The chunks are 6 byte rules for state transitions which follow this tree:


Here is a test contract that I’ve set the above contract as the emoji parser and a BasicValidator contract that implements /^[a-z09-_.]+/$ as the validator.

  • filter(string name, bool ignoreFEOF) should parse any UTF8 string with emojis.

  • validate(string name) should validate any normalized name which uses valid emoji or obeys the regex above. Note: This will fail if emoji have FE0F (which isn’t normalized). This should validate 94% of registered names as-of today.

1 Like

Outstanding! Do you have an estimate of what it would take to extend this to match your entire normalisation function, not just Emoji?

It depends on how IDNA and NFC are implemented. If you implement them verbatim, it will be very expensive to deploy like royalfork estimated. However, there’s probably a partial mapping that would give a lot of bang for the buck.

I think there is a possibility of using NFKD → Apply the small number of differences from IDNA 2003 → NFC, but I’d need to verify that this is the same thing as IDNA 2003 → NFD → NFC. You could avoid a whole lookup table with this approach.

For IDNA, we can pack 10 uint24 per storage slot for single char mappings. There are runs of “ABCabc” and “AaBbCc” for casefolding, and runs of “ABC”, “A?B?C?”, and “A??B??C??” that map to a constant. There’s a bunch of ideas in my github.

If, for the moment, we reject combining marks and composite characters (so NFC isn’t needed) then my emoji parser + an efficient IDNA mapping would be a string → string implementation. All the effort would go into efficiently compressing the mapping.

Otherwise, I think the most important contract is an NFD/NFC implementation.


For validation:

You can naively encode the valid output character set into a bitmap for 700 storage slots. Combine that with the emoji contract above + NFC quick check and that’s a gas efficient function for determining if a name is in the normalized form. I can demo this immediately once I have an NFC implementation.

I also think a few separate simple validators for each script would give a lot of coverage (and more could be added over time). However, this approach requires a mechanism for looping through the set of validators to find one which returns true for your name.

I mentioned in the other thread, you could store the address of validator on-chain for the name you verified, making all future validations only two SLOADs, node → validator address → is valid validator address, but this requires extra infrastructure.

1 Like

This seems to be a reasonable starting point. Are there existing domains that use combining marks or composite characters?

1 Like

5K names have combined characters (NFC != NFD.)
171 names have two combining marks that could be reordered.

Codepoints: 2.4K CM and 13K composed.

Maybe that is a good idea, just compress the IDNA part and disallow the codepoints that need NFC. That’s 99% of names unless I’m forgetting something.


@adraffy/ens-norm-research has an improved contract that only needs 360 storage slots and no longer requires the skipFE0F option, and the code that generates the state payload.


Emoji + IDNA 2003 - (Combining Marks, Compositions) can be done with 10KB contract code, 128 slots for Valid bitmap, 450 Slots for Mappings of [1-2 CP], 175 Slots for Mappings [3+ CP], 360 slots for Emoji.

[128 + 452 + 175 + 360] → 1115 slots → ~30M + contract gas so under 1 eth to deploy.

122810 valid, 4208 mapped, 9 ignored, 3046 emoji sequences.

Since the IDNA table is stable, a huge chunk of the valid and single character mappings can be compiled directly to code.

I think worse-case per character would be, ~15 comparisons (x < A), 10 range comparisions (x >= A && x <= B), 1 sload for valid, 1 sload for small mapping, 1 sload for big mapping, and then calling the emoji function which is (1 sload per codepoint.)


Edit: Here is my first working version: https://rinkeby.etherscan.io/address/0xdb68eb6f0ab93bc7059f2752d477b45f04457eb2#readContract

Cost 32M gas to deploy and populate.

image

Should work for every name except those with combining marks or characters that compose (like à).


Edit 2: I deployed a more gas efficient version that avoids using uint24[]. It also has a beautify() function.

I’m seeing about 50-100K gas per name.

Although I have little idea what you’re on about half the time here, your commitment to see this all the way through is admirable, @raffy :1st_place_medal:

1 Like

I got a version with NFC Quick Check, which means only disordered combining marks or decomposed characters will incorrectly fail to normalize.

https://rinkeby.etherscan.io/address/0x335be342669ae015d7a87eb7c632447f9218254b
43M gas

🤼🏻‍♂️Ⅷ👩🏽‍⚕️_$🇦ß.۱۲۳👨‍👩‍👦9️⃣.ѐ🏴󠁧󠁢󠁥󠁮󠁧󠁿.eth

How will they be handled? Do they throw an error?

Roughly how much gas does normalize cost?

NFC Quick Check failures throw NotNFC().
Disallowed throw InvalidCodepoint(uint256 cp).

  • 🤼🏻‍♂️Ⅷ👩🏽‍⚕️_$🇦ß.۱۲۳👨‍👩‍👦9️⃣.ѐ🏴󠁧󠁢󠁥󠁮󠁧󠁿.eth → 180K gas
  • 💩💩💩.eth → 50K gas
  • RAFFY.ETH → 60K gas
  • maybe-its-time-to-stop-registering-garbage-like-this-and-buy-premiums-from-secondary-instead-such-as-3-or-4-digits-or-clean-one-word-domains-like-soy.eth → 280K gas

An ASCII fastpath in-front of the emoji checker would probably save a lot of gas. Only saves 1 sload.

Edit: Actually, once both emoji and valid are known, a fast path bitmap can be stored in a single uint256, that returns true for any valid character that isn’t an emoji prefix. At the moment, that’s true for 81 of 0-255. That would make DNS names much cheaper if we need this kind of optimization.

  • raffy.eth → 29K gas
  • maybe-its-time-to-stop-registering-garbage-like-this-and-buy-premiums-from-secondary-instead-such-as-3-or-4-digits-or-clean-one-word-domains-like-soy.eth → 111K gas (save 170K)

Another question is whether the contract ultimately should be adjustable. I’m not sure yet the best way to encode the large tree of codepoint ranges → codepoints (I’m using a ~700 line function.)

This sounds reasonable to me. Is there a reason we would need to implement the full NFC, in this context? Rejecting invalid normalisations seems sufficient to me.

Not bad! We shouldn’t generally have to do these in a transaction, but it’s nice to know it’d be viable if we really had to. Do you think there’s much room for gas optimisation here?

I’d be in favor of immutable contracts that can be replaced as needed; this tends to save gas anyway, and makes verificaiton much easier.

We should have full NFC for when someone inputs decomposed characters.

Now that QC works, I think I’ll just implement the algorithm as-is in my next release, since it will be invoked rarely.

Good to know.

Can we not simply reject such names as invalid?

We’ll have to define “cheap”, but I believe Unicode adds new characters once a year-ish, so if each deployment is >$1k (or operationally complex with multiple transactions), I’d think that easy upgradeability should be a design requirement (ie: new code points can be set to “valid” by a single transaction).

I’d say yes, for the same reason that you’d also reject labels which contain minimally-qualified emoji.


What do we think about the following top-level API?

/**
 * @notice Compute namehash of domain after ENS normalization and
 *         validation.  Reverts if domain contains disallowed 
 *         codepoints, disordered combining marks, or decomposed characters.
 * @param domain Domain to namehash.
 * @return normalized Normalized domain, to be displayed to end-users.
 * @return node Namehash of domain.
 * @return charsets Bit array of unicode character sets present in the domain.
 */
function namehash(string memory domain) public view returns (string memory normalized, bytes32 node, uint256 charsets) {

// Unicode charsets
var (
	ASCII = 1 << iota
	Latin-1
	Greek
	Cryllic
	Emoji
	...
)

This includes a charsets return value, which tells the caller which charsets are present in the domain. I think this would make ENS much safer and inclusive for international users. For example, English sites could “flag as suspicious” any domain where charsets != ASCII & Emoji, and Chinese sites could flag charsets != Chinese & ASCII, etc.

Open Questions:

  • This function doesn’t really play well with the existing “UniversalResolver”. Should this function operate only on labels? Or should it have tighter integration with Universal Resolver? Either way, I think users should have an on-chain function which provides direct string -> (...helpful ENS info...) functionality.
  • I’m not wild about the namehash function name, since this returns more than just the namehash. Any suggestions for a better name?

@raffy That is some really impressive code you have there! Very well done :slight_smile:

Here are some thoughts:

Assume max label length fits in 16 bits (or w/e). The low level normalize should be string → (string norm, string normNoEmoji, uint256[] labelData) where:

  • normNoEmoji is the same string but each emoji sequence is zero’d
    eg. :poop: (4 bytes) → [0,0,0,0]
  • each labelData is:
    • 16 bit → label length
    • 240 bits → bitset of active non-emoji codepoints shifted by 14
      (2^21/240 => 14 bits)

From that, it’s easy to compute the namehash, extract out any label, compute any label hash, or quickly determine which validation could apply. eg. the basic validator (DNS + Emoji) only runs if the bitset is 0x1. normNoEmoji avoids processing emoji again during validation.

The primary normalize function should be string → (string, hash) like you describe.


I think all the charset stuff should exist in the validator, whether that’s the same contract or a different contract, I’m not sure. Most validator checks are per-label, so given (label, bitset), you can efficiently check if its valid. IIRC, only check bidi requires a full-name check.

The low level validation function would be (string, bitset) → bool like:

function validate(string label, uint256 bitset) returns (bool) {
     // any 0 byte is a previously processed emoji
     if (bitset == 1) { // only codepoints [0, 0x4000)
         // check if basic
         // check if latin non-confusable
         // check if greek w/o wholescript confusable
         // etc...
     }
     return false;
}

Or more cleanly, the validation contract could have arbitrary functions key’d by bitset + nonce, so you could register/unregister any number of validation functions, and quickly check which ones could apply by intersecting the label bitset, etc.


The primary normalize + validation function would be string → (string norm, uint256 hash, bool valid) where:

given name
(norm, normNoEmoji, labelData) = normalize(name)
start = 0
hash = 0
valid = true
// possibly apply full-name check
for [len, bitset] of labelData.reverse()
    hash = keccak(hash + keccak(norm.slice(start, len)))
    valid &= validate(normNoEmoji.slice(start, len), bitset)
    start += 1 + len
return (norm, hash, valid)
  • If this throws, the name is invalid
  • If valid is false, the user should get some kind of warning that the name is potentially unsafe (where unsafe means one or more labels satisfied 0 approved validators.)
  • If valid isn’t needed, use the primary normalize function instead.

I haven’t thought much about what the internal API should be, but this certainly looks reasonable. Just to clarify, is the motivation behind the emoji special case (norm vs normNoEmoji) to sort out the 0xfe0f issue? (ie: 0xfe0f must be present for “input/display normalization”, but be absent for namehash calculation?)

Would this also be part of the internal API? I don’t thing the nuance around valid should be exposed to end-users.

Why wouldn’t you also throw in this case?


I think the first step is to build general consensus around a top-level public API to be used by ecosystem developers. Once we have that, I’m pretty confident we can work backwards to nail down the low-level internals.

Validation only needs to check if the non-emoji part of the string obeys all the Unicode rules. But you need to run the full emoji logic to skip over them, so each validator would then need efficient access to the emoji logic. Whereas, allocating a little more memory and keeping a second copy with zero’d emoji bytes means each validators can just skip over 0’s and be fully ignorant of emoji.

This lets you construct the basic validator /[a-z0-9_-]+/ (and many others) very easily.

This is my opinion but I think normalizable-but-confusing names should still work. Especially for the headless situation, eg. aрe.eth confusable can normalize and hash, but valid would be false.

There exist creative/vanity names that will never satisfy even the most relaxed script-based rules simply because they use a single-character of a script that’s confusable in other contexts.