On-chain ENS Domain Normalization

Way back in December, I made the following comment in our now infamous ENS Name Normalization thread:

After taking some time to look into it, I believe that an onchain implementation of “ENS Name Normalization” (as proposed by @raffy in Draft ENSIP - Standardization of ENS Name Normalization) is both technically feasible and financially economic within a mainnet contract.

As a small proof of concept, I built github.com/royalfork/ens-ascii-normalizer, which you can try out on etherscan. ENSNormalizeAscii contains the following functions:

  • namehash(string domain) returns (string, bytes32) returns the normalized domain, and the namehash node of the normalized domain (reverting if domain is invalid).
  • owner(string domain) returns (address) uses namehash to return the domain’s current ENS owner.

Note: This contract only supports ASCII-only domains. If a domain contains non-ASCII characters, namehash/lookup will revert.

Moving to full unicode support:

  • utf8 encoding/decoding incurs additional complexity and computing cost over just ASCII domains. In preliminary tests, validation of a simple 6 letter unicode domain cost about 200000 gas (~$5), which is probably too expensive to be called by other contracts (although this number is based on unoptimized solidity code, and could be reduced).
  • IDNA2008 has ~1 million “unicode” rules which would all need to be supported. These rule can be compressed (run length encoding) into a set of ~2000 “rule import” eth transactions, each costing ~1,500,000 gas. This puts the total cost of IDNA2008 rule importation at around $85,000 (at 30 gwei gas).
  • Emoji validation requires importation of another ~3000 rules (based on saving the full RGI set on-chain, but it might be possible to do this with less). Not sure on the best data structures for this part yet, but would estimate that emoji validation data import would cost another $20,000-$50,000 in gas (I can try and firm up a better estimate if needed).
  • Most UTS-46 mapping would be encoded into the IDNA2008 rule set, but “1 char → n char” mappings would require separate importation. There are only a few of these, but this could add another ~$10,000 to deployment costs (also, implementing unicode 1->n mappings isn’t as simple as the ascii 1->1 mappings, some inline assembly is required to make those mappings memory efficient).
  • Once on-chain, the rules would be customizable, so an “owner” could allow/disallow certain characters and/or change how context-based validation works (emoji validation is implemented via context-based validation).

Not sure if ENSAsciiNormalizer is the first mainnet “string → owner” contract for ENS, but I find it nice to perform direct on-chain domain lookup/validation without the need for intermediary client-side code (it’s also nice to check for ZWJ characters directly on etherscan). If there’s any interest in a “full unicode normalization” contract (and we can collectively stomach the ~$200,000 in fees it would take us to get there), I can put together a more detailed RFP. Would be pretty cool to implement something like this, and it could vastly simplify 3rd party integrations (instead of asking etherscan, metamask, opensea, ethers.js, everyone else to “update their code” everytime ENS normalization rules are updated, we can just point them to our easy-to-use on-chain normalization function and be done with it).

5 Likes

Onchain normalisation would be really useful to have; it would mean that implementations can start by using it, and transition to doing it in the client library for efficiency later, if desired.

Gas efficiency for normalisation needn’t be the top concern; normally there’d be no reason to do the normalisation inside a transaction - instead, clients can call the normalisation function to get the namehash, and pass that in to the transaction.

The costs to upload the tables will be substantial, but I think that with the preprocessing and compression work @raffy has done, it’s probably viable for a much smaller sum.

5 Likes

I extended this into a proof-of-concept for emoji+ascii normalization, and deployed on ropsten here: https://ropsten.etherscan.io/address/0x31096216008a6bc55ba6434488d14c86bdcf4283#code.

This contract maintains a list of all canonical/allowed emoji sequences, and only allows emoji within that set. ZWJ characters are only allowed within emoji sequences, and are only allowed in those emoji sequences that specifically use them, where they must match the allowed sequence verbatim. There’s also some special handling for 0xFE0F such that only fully-qualified emoji are returned as “normalized”, but the minimally-qualified, without 0xFE0F, is used to compute the namehash/labelhash (meaning this is fully backwards-compatible with existing emoji registrations).

This PoC implementation wastes a lot of storage (with expensive “nested maps”), but even in its unoptimized state, saving a large, 5-char-long emoji sequence into the allowlist costs 140k gas (https://ropsten.etherscan.io/tx/0x9ecf9c9665f415e5c97511489efa5185ab3d82ec0c62b8f7a88fa34023462195), which is about $13 on mainnet (30gwei). Saving 3000 RGI emoji would cost worst case $40k, but most likely a lot less (@raffy has some ideas on how this can be substantially improved).

Note: For this deployment, I only added :seven: and :policewoman:t2: to the emoji allowlist…all other emoji will revert if you try and namehash, because they aren’t in the allowlist.

2 Likes

@royalfork - Awesome work. Thanks for taking the time to share the concept and PoC it.

@slobo.eth - Could we get this on the agenda for this week’s Ecosystem WG meeting? It’d be great to discuss this synchronously.
We’re back to 7pm GMT on Monday this week, correct?

1 Like

Here is an initial version of an efficient emoji parser library that should be a drop-in for your contract.

I deployed a contract using it here and populated it.

It requires less than 1 eth in gas to deploy and supports all Unicode 14 emoji. (If the upload mechanism is exposed, it’s also upgradeable, but since it’s so cheap, its probably better to seal it and redeploy for future Unicode updates.)

  • function read(Emoji storage self, uint24[] memory cps, uint256 pos, bool skipFE0F) returns (uint256) — will return the length of the next emoji token, if it exists. If skipFE0F is true, it will effectively ignore FE0F (but it will also return a non-zero result for a run of FE0F's without any emoji.)

  • function filter(Emoji storage self, uint24[] memory cps, bool skipFE0F) returns (uint24[]) — will replace emoji tokens with a single 0xFFFFFF.

If you use as a library, you’ll need to call upload() with the following 10KB payload, broken into (2) chunks:

Chunk 1

Chunk 2


The chunks are 6 byte rules for state transitions which follow this tree:


Here is a test contract that I’ve set the above contract as the emoji parser and a BasicValidator contract that implements /^[a-z09-_.]+/$ as the validator.

  • filter(string name, bool ignoreFEOF) should parse any UTF8 string with emojis.

  • validate(string name) should validate any normalized name which uses valid emoji or obeys the regex above. Note: This will fail if emoji have FE0F (which isn’t normalized). This should validate 94% of registered names as-of today.

Outstanding! Do you have an estimate of what it would take to extend this to match your entire normalisation function, not just Emoji?

It depends on how IDNA and NFC are implemented. If you implement them verbatim, it will be very expensive to deploy like royalfork estimated. However, there’s probably a partial mapping that would give a lot of bang for the buck.

I think there is a possibility of using NFKD → Apply the small number of differences from IDNA 2003 → NFC, but I’d need to verify that this is the same thing as IDNA 2003 → NFD → NFC. You could avoid a whole lookup table with this approach.

For IDNA, we can pack 10 uint24 per storage slot for single char mappings. There are runs of “ABCabc” and “AaBbCc” for casefolding, and runs of “ABC”, “A?B?C?”, and “A??B??C??” that map to a constant. There’s a bunch of ideas in my github.

If, for the moment, we reject combining marks and composite characters (so NFC isn’t needed) then my emoji parser + an efficient IDNA mapping would be a string → string implementation. All the effort would go into efficiently compressing the mapping.

Otherwise, I think the most important contract is an NFD/NFC implementation.


For validation:

You can naively encode the valid output character set into a bitmap for 700 storage slots. Combine that with the emoji contract above + NFC quick check and that’s a gas efficient function for determining if a name is in the normalized form. I can demo this immediately once I have an NFC implementation.

I also think a few separate simple validators for each script would give a lot of coverage (and more could be added over time). However, this approach requires a mechanism for looping through the set of validators to find one which returns true for your name.

I mentioned in the other thread, you could store the address of validator on-chain for the name you verified, making all future validations only two SLOADs, node → validator address → is valid validator address, but this requires extra infrastructure.

1 Like

This seems to be a reasonable starting point. Are there existing domains that use combining marks or composite characters?

1 Like

5K names have combined characters (NFC != NFD.)
171 names have two combining marks that could be reordered.

Codepoints: 2.4K CM and 13K composed.

Maybe that is a good idea, just compress the IDNA part and disallow the codepoints that need NFC. That’s 99% of names unless I’m forgetting something.


@adraffy/ens-norm-research has an improved contract that only needs 360 storage slots and no longer requires the skipFE0F option, and the code that generates the state payload.


Emoji + IDNA 2003 - (Combining Marks, Compositions) can be done with 10KB contract code, 128 slots for Valid bitmap, 450 Slots for Mappings of [1-2 CP], 175 Slots for Mappings [3+ CP], 360 slots for Emoji.

[128 + 452 + 175 + 360] → 1115 slots → ~30M + contract gas so under 1 eth to deploy.

122810 valid, 4208 mapped, 9 ignored, 3046 emoji sequences.

Since the IDNA table is stable, a huge chunk of the valid and single character mappings can be compiled directly to code.

I think worse-case per character would be, ~15 comparisons (x < A), 10 range comparisions (x >= A && x <= B), 1 sload for valid, 1 sload for small mapping, 1 sload for big mapping, and then calling the emoji function which is (1 sload per codepoint.)


Edit: Here is my first working version: https://rinkeby.etherscan.io/address/0xdb68eb6f0ab93bc7059f2752d477b45f04457eb2#readContract

Cost 32M gas to deploy and populate.

image

Should work for every name except those with combining marks or characters that compose (like à).


Edit 2: I deployed a more gas efficient version that avoids using uint24[]. It also has a beautify() function.

I’m seeing about 50-100K gas per name.

Although I have little idea what you’re on about half the time here, your commitment to see this all the way through is admirable, @raffy :1st_place_medal:

1 Like