ENS Name Normalization 2nd

To avoid that issue of collusion with the full-stop, we should enforce:

  • Can’t touch another 201A

  • Can’t start label or end label

  • Can’t touch an emoji

  • Must be in between digits

Mapping 002C to 201A is ideal.

I split the Different Norms report into categories based on the type of mapping required (Arabic, Hyphen, etc.). If a name required more than one technique, I put in the “everything else” bucket. (For example, a name could have both an incorrect hyphen and apostrophe – fortunately, most only have a single issue.) I removed the trivial differences (like ASCII casing.)

Quick overview:

  • 42% are Arabic digits
  • 26% are improperly normalized emoji
  • 12% are wrong hyphen
  • 11% are wrong apostrophe
  • 3% are circled/squared ASCII

I’m not particularly fond of the circle/double-circled/negative-circled mappings (I’ve mentioned this before) but I’m unsure what should be done. IDNA only maps the circled-letters (Ⓐ) and leaves the rest alone. I decided to map all of the remaining types except for the Negative Squared digits (which overlap with emoji.) Maybe they should just be disallowed instead? Two sets of negative circled digits is clearly wrong. Negative Circled and Squared letters are very similar.

The actual mappings rules can be found here: chars-mapped.js



image
image
image
image

3 Likes

#4 in of Disallowed Characters 201A ( ‚ ) Single Low-9 Quotation Mark should be allowed as Valid

In English, we use commas to organize numbers greater than 999. We use a comma every third digit from the right.

  • More than 50,000 people turned up to protest.

The comma every third digit is sometimes known as a “thousands-separator.” Make sure you don’t include a space on either side of this comma.

Correct:

  • We will walk 10,000 miles.

Incorrect:

  • We will walk 10000 miles.
  • We will walk 10, 000 miles.
  • We will walk 10 , 000 miles.
  • We will walk 10 ,000 miles.

Reference Source

Why 201A for ENS?

  • It would be the most feasible option to map 002C (the Keyboard Comma) into 201A

  • We would see a surplus growth in ENS because of the prime example of the 10K Club & 100K Club

  • The 10K/100K Clubs .eth domains, to me, are related to real estate property numbers and zip codes due to my prior property management experience. Respectfully, there could be a topic illustrating that 8,345.eth, creating a need for a more detailed explanation.

  • 8,345.eth would bring an ideal digital identity .eth name in particular because of how the digit comma separator precisely organizes the thousandths, millionths, and so on.
    8345.eth to me is the representative example of an address of a certain real estate property.

  • 78000.eth is another example of a number sequence that has an established meaning in reality. It has more of a zip-code feel than a digital identity name.
    78,000.eth has a more calm and collected style if I were to personify myself as that identity.

  • All in all, the addition of symbol 201A will be very beneficial to the ENS Ecosystem and the DAO.

1. This would bring in more registration revenue for ENS.
2. Allows for run-on sequences of digits to be easily readable and easier to send funds.
3. We could allow beginners wanting to get involved in ENS to have a good access point to become a part of the ENS ecosystem, with a good opportunity to do so. For example, we saw a huge spike in new users participating in the ENS ecosystem when 10K Club/100K Club became of existence.
4. Allows for a better intrinsic experience when utilizing the digit comma domain as a digital identity.
5. A better overall experience for the user would be granted by adding 201A when digits are used as normalized domains.

1 Like

Hmm, I generally disagree with allowing comma or comma-likes as I think there could be too many potential conflicts out there. Mapping is also something that should not be taken lightly, as you can’t unmap, not without making breaking changes.

Not every character used in everyday human speech makes sense to use in a global naming system like this. And registration revenue does not matter here, I don’t think that should factor in at all when it comes to creating a robust set of normalization rules.

4 Likes

Totally agree with not allowing it

2 Likes

Once again, I apologize for the slow turnaround.

The spec has been pretty stable and I’m satisfied with how it functions. I’ve had time to review the breakdown reports and I think they look reasonable.

A few months ago, an earlier version of ens-normalize.js (lacking validation/confusables) was incorporated into ethers.js. Marketplaces like ens.vision already use ens-normalize.js. I’ve been monitoring daily registrations and have seen a pretty large reduction in invalid names.

I’ve received many DMs with questions and concerns and appreciate all the feedback.


I’ve updated my eth-ens-namehash branch with the latest ens-normalize.js logic. I decided to only include the minimal code necessary (rather than all of the derivation and validation stuff.) I updated the PR too.

Everything else can be found in my ens-normalize.js repo.

At the moment, my ENSIP document is still unfinished. I’ve been struggling to concisely describe everything. Possibly the formatting style I chose—nested (un)ordered lists instead of paragraphs or pseudo-code—is just too limiting. I just need to complete the section on how I resolve confusables and then I’ll publish it.

I don’t exactly know what the next steps are in this process but I think there’s enough code, tests, and reports to get the ball rolling.


Here are some notes to help pinpoint an area of disagreement:

  • I don’t think any of my emoji choices are controversial. AFAIK ens-normalize.js is the only normalization library which enforces correct emoji sequencing. I’m not aware of any exploits or spoofs beyond emoji that actually look similar.

  • I think my hyphen changes have the correct balance of reducing confusables but providing good UX to users.

  • I think _ and $ are great additions. I allowed all modern currency symbols. The correct Ethereum symbol is Greek Xi. The other triple-bar characters are confusable.

  • I disabled Braille, Sign Writing, Linear A, and Linear B scripts.

  • I disabled all Combining Characters, Box Drawings, Di/Tri/Tetra/Hex-Grams, Small Capitals, Turned Characters, Musical Symbols, and other esoteric characters.

  • I disabled many formatting characters like digraphs and ligatures.

  • I disabled nearly all punctuation and phonetic characters.

  • I heavily curated the allowed combining marks in Latin-like scripts. Not all exemplar recommendations are allowed. I used prior registrations and scripted dictionaries to decide the limit on the number of adjacent combining marks, however native users might have different opinions.

  • I disabled many obsolete, depreciated, and archaic characters.

  • For non-emoji symbols, my convention was always to choose the heavy variant if available, rather than have 5+ differently-weighted variants of the same symbol.

  • I merged upper- and lowercase confusables. For example, scripts with a Capital G-like character (with no lower-case equivalent) confuse with Latin g since Latin G is casefolded to g.

  • I’m aware there are multiple-character confusable but I’m not aware of a reasonably exhaustive list of known cases. For the ones I’m aware of, I don’t think the implementation complexity is worth it.

  • There are still unresolved confusables where I can’t decide which character is the preferred one. The default has been to allow them. You can seem many of them by browsing the Han group using my confusable explainer.

  • There are features that ended up being dead code because they aren’t needed (but might be needed in the future.) Instead of including that code, it is commented out and there are checks during the derive process which fail if the code is required. For example, after all of the combining mark + confusable logic is applied, there aren’t any whitelisted multi-character combining mark sequences that don’t collapse (NFC) into a single character.

  • ADDED Non-IDNA Mappings — discussed above.. I think my solution is more consistent but I think disallowing these characters is also valid.

Nearly all of the decisions above can be found in /derive/rules/ and all of the necessary data for implementation can be found in /derive/output. There also is a text log of the derive process which annotates all of the changes relative to IDNA.

Resolver demo and npm package are using the latest code.

11 Likes

Hello,

I don’t mean to rush the process but I was wondering if we could get an ETA on normalization being finalized. With the subdomain wrapper releasing soon, I’m assuming the idea is to push normalization prior to it.

2 Likes

A concern related to subdomains that can be addressed by normalization:

  • should "...a...eth" become "a.eth"?

I think null labels (0-length) should collapse. The null label is perfectly valid on a subdomain but seems kinda silly—I can’t think of a use-case.

Maybe that should just throw an error as invalid instead.

Perhaps it’ll catch some fat-finger or copy/paste mistakes, like if someone wants a.name.eth but types in .name.eth. The current behavior on both the manager app and Metamask is to throw an error:

image

image

1 Like

I know ethers explicitly disallows null labels and also disallows trailing stop.

UTS-46 VerifyDnsLength is technically false since ENS permits arbitrary length names. When false, it also allows trailing stop (which is wrong since namehash root label is 0x0 not keccak("")).

I think it should either be:

  1. strict: require (1+)-length labels (which would deny leading/trailing/adjacent)
  2. polite: normalize away null labels: "...a....eth == a.eth"

Edit: I will change to strict and add example code for how safely collapse null labels (which may have interleaved ignoreables.)

1 Like

Checksummed addresses solve the fatfinger/copy paste problem. I try to never copy paste unchecksummed addresses to a wallet. This includes ENS names. ENS names are great for simplifying addresses many times, but you need to be aware of what you are getting into.

Wallets currently have poor ENS support, as do most UXs. Relevant info like domain to address age, transaction count, and balance are important for the times when you do give up security for convenience. It’s ok to to type in a name.eth first if you can remember the address, but you would be extra sure if you could see the domain age in the UX.

Null labels should be considered invalid.

Do you have an updated report on how many domains are affected by the new normalisation function?

Hi, I don’t want to mess with the algorithm, but I am surprised that all apostrophes maps to and not to '. The second one is ASCII and is easy to type (just one button on my keyboard).

Various breakdown reports, trivial to produce some other format if needed.

For IDNA, 27 (') APOSTROPHE is disallowed and 2019 (’) RIGHT SINGLE QUOTATION MARK is valid.

2 Likes

I had my first bug report in ens-normalize for a misnamed variable in the combining mark counting code. Fortunately, it impacts no registered names, but it indicated a missing test case: a string with both decomposable characters and excess combining marks near the end of the string. It was found by Carbon225 while developing of a Python normalization port.

Related: there are only a few names that fail due to excess combining marks (most true abuses fail earlier for different reasons, like illegal use or invalid mixtures). If anyone with experience with these remaining examples could comment about the validity of these names, it would be greatly appreciated.

The notation used below is that the name matches the group Bengali but a character was found that is followed by 3 CM, where the maximum allowable was 2 (eg. 3/2). I’m contemplating changing the CM limit to 3 for all non-CM-whitelisted groups. I’m currently using a value of 1 or 2 (see: cm:#).. Edit: the Unicode recommendation is max of 4 NSM.

image


I had a request to enable a different check character, likely due to the popularity of the Checks NFT. Upon further inspection, it does land in an unfortunate grey area.

2714 (✔︎✔️) heavy check mark2713 (✓) check mark
image

My convention for deciding amongst characters of different weights (very thin, thin, regular, medium, heavy, very heavy, etc.) was to choose the heavy variant if available. For checks, the heavy variant is also the emoji character.

Because normalized emoji have their FE0F stripped and check is default text-presentation (Emoji_Presentation=false), a normalized check emoji looks like a bold textual check. I made a simple demo which shows a few heavy variants with emoji forms and their corresponding most similar textual character (“alt”) . If you view this page on different browsers, operating systems, and devices (desktop/mobile), you’ll notice that it’s visually inconsistent.

I understand the desire for a textual check but the unpredictability of emoji appearance makes ✓✔︎ too similar to enable in good conscience. For example, if ✓ was ✕, we’d be having the same discussion about xX✕✖︎✖️.


If there’s anything else I can help calculate or do to facilitate DAO adoption, please let me know.

The things that would immediately benefit from the new normalization spec:

  • app.ens.domains: registration input, showing script of labels, beautified name, etc…
  • metadata: showing beautified names in the image/svg and properly assigning :warning: in marketplaces
  • etherscan: beautified primary names
  • metamask: their ENS input needs a lot of work
4 Likes

Does this mean you’re ready to lock the spec in and submit it as a standard?

9 Likes

For my final changes which address (2) prior concerns:

I’ve changed to NSM counting with a maximum of 4 unique characters for all non-CM-whitelisted script groups, like the Unicode security suggestion. This works much better than I expected. The Breakdown report has changed from cm → nsm.html and there are now only 4 exceptions.

I didn’t get much input on these characters. According to the Breakdown reports, usage is very low. To error on the safe side, I’ve decided to disallowed these characters instead so they may be revisited at a future date (whereas mapping would leave them permanently unusable.)

I’m happy with version 1.9.0. I will update my ENSIP with these final changes ASAP.

3 Likes
4 Likes

Just to confirm, do you consider the current state of the ENSIP ready for last call and then finalization?

1 Like

Yes. It has some URLs that currently link to my repository. The only critical link would be to spec.json.

7 Likes