ENS Name Normalization 2nd

serenae · November 9, 2022, 2:33pm

I don’t know Hebrew. But I think both should be allowed, as long as the name is single-script. Same goes for the other examples in that list, like く or ノ or へ (which would block out a lot of legitimate words if disallowed).

The users of those languages around the world must already be intimately aware of any possible confusables, same as we in the English world are very aware of “I” vs “l” or “m” vs “rn” and so on.

As long as the name is contained to a single script then I think that’s fine.

accessor.eth · November 10, 2022, 9:40am

I agree with this.

May I suggest that names would be disqualified for a refund if:

have been traded
sold for a profit
has been used for scams, wallet drains, exploits, etc…
Names attached to any addresses that are blacklisted by USDT or USDC or any other assets due to misuse or unauthorized behavior or actions on chain.
( i know that would probably be a tough list to accumulate )

nick.eth · November 10, 2022, 8:37pm

In this case the refund would go to the current owner; the original owner has already been paid.

These would be extremely difficult to adjudicate.

Theth.eth · November 11, 2022, 2:17am

I’m guessing that they will also only get a refund of the minting price and not what they paid on the secondary market

nick.eth · November 11, 2022, 4:15am

That’s correct.

raffy · November 18, 2022, 8:04am

For now, I’ll have a global setting for any script with more than 1 undecided confusable, and I’ll default to allow. However, I do think many these characters are confusable and should be reduced to one.

Previously, I ran into two problems while trying to finalize my latest changes in words. I had a problem sharing characters between multiple scripts and I was confusing on names of the form AB where A was A-C confusable and B-D confusable but C-D isn’t confusable—essentially it was confusing against a name that wasn’t possible. I also had the script logic setup an inefficient way which required all of the esoteric stuff to be tested first.

I discovered these issues while updating the ENSIP document and making some small adjustments based on some input I’ve been getting from other ENS users.

In fixing this issue, I think I greatly simplified the process.

First, there’s the description of the different types of names we want to allow. At the moment, these are script-centric (I’ve been calling them ScriptGroups), but they could be more general. Here are two examples:

{name: 'Latin', test: ['Latn'], rest: ['Zyyy', 'Zinh'], cm: -1, extra: explode_cp('π')}
{name: 'Japanese', test: ['Kana', 'Hira'], rest: ['Hani', 'Zyyy'],  cm: -1, romanize: true}

The way it’s defined is unrelated to what it actually is—it’s simply a set of characters and some rules for which characters can have combining marks.

{name: 'Latin', cps: [61, 62, ...], cm: ['â', ...]}
{name: 'Japanese', cps: [...], cm: [...]}

Second, there’s the IDNA-like part, where you take a string, parse out emoji, remove ignored characters from remaining sections, apply mappings, and then NFC those segments. This produces a simplified version of ens_tokenize() which only has 2 token types:

[ Emoji("💩"), Emoji("💩"), NFCString("a"), Emoji("💩"), NFCString("bc") ]

You actually don’t need to know if the characters are valid at this point. However, you can precompute a union from the characters in each ScriptGroup (and their decomposed parts) to fail early. You can also precompute non-overlapping sets, for which only a unique group can match. From the original group definitions, the characters can also be split into primary/secondary. For example, “a” is primary to Latin and “1” is secondary to Latin (since it’s primary to Common.)

This makes the matching logic very simple:

Given the list of simple tokens, just look at the textual ones.
For each character, see if there is a unique group.
- If there is more than one unique group, error you have an illegal mixture.
If you didn’t have a unique group, find all the groups that contain any primary character in the label.
- If there’s no groups, error you have some weird name that matches no groups.
At this point, you have at least one group.
For each group, check if every character is valid according to that group and apply the combining mark rules. The first group that matches will work. If no group matches, throw the first error that occurred.

Note: there are strings that are made entirely of valid characters but match no groups. There are also strings that match multiple groups (which also may have different combining mark rules.)

With this setup, the order of the groups only matters for when you have a name that matches multiple groups (and the choice really doesn’t matter). For example, any digit that’s shared between multiple scripts, might fail on a restricted group but pass on normal group. I’ve set the order of the groups to match the distribution of registrations for efficiency. It’s also easy to just extract 1 group, like Latin, and ignore everything else, which would make a very tiny library.

For marks, I support a few modes: either you can explicitly whitelist compound sequences like, “e+◌́+◌̂” (which I’m doing for Latin/Greek/Cyrillic and a few others) or you can specify how many adjacent marks are allowed (when decomposed). The set of available marks is also adjustable per ScriptGroup (most marks are in the Zinh script because they’re shared between multiple scripts, but some script extensions give scripts all their own marks.) For example, I can bump Arabic to 2 and Common to 0, while leaving restricted groups at 1.

There’s also 2 Unicode script adjustments: There’s script extensions, which change a singular (Char → script) map to a (Char → Set<script>). For example, there are some characters used in exactly N scripts, but not in any others (unlike Zinh which is allowed in all). There’s also augmented scripts, which permits separate scripts to mix, like Kana+Hani or Hira+Latin. Lastly, there’s edits we might want to make, like letting a restricted script access it’s currency symbol or allowing a scripted character (ether symbol is Greek) to be used universally.

And lastly there’s confusables, which are a set of different contiguous sequences which compose visually-indistinguishable. Unicode provides only a 1-character to a N-character map, eg. the single character "ɨ" looks like "i+◌̵" (which is 2). Confusables between different groups is one problem (whole-script confusable). Confusables of the same group are another (single-script confusable.) Mixed-script confusables are replaced by the ScriptGroup logic, which allows you to construct a Japanese name with Kana+Hira+Hani+Zyyy+Latn+Emoji but not Latn+Grek.

This makes the process simple: just need to generate the correct ScriptGroups according to all these constraints, which allow the simple matching logic to work 100%.

I think I’ve solved the issue. I appreciate the patience.

raffy · November 26, 2022, 12:05am

Confusable Tool (alpha)

Two confusables that together are non-confusable

Common “0” + Cyrillic "x"

Yellow rows are the characters in the string. The cells represent the extent of a confusable: they can span groups and there can be multiple characters from the same group. The green header means all the characters in that string are in that group. A framed green header corresponds to the normalization decision. Red headers represent another group that can recreate the input string using at least one confusable swap.

Green cells were decided as non-confusable. Orange cells had no decision and are decided automatically (default allow). Blue cells were decided as confusable but allowed. Red cells were leftover characters in a decided confusable (they still could be valid if they’re the only character in their group.)

After IDNA, normalization, and ENS rules are applied, any cell with the same Capital letter (within a confusable) are interchangable. Bold cells are non-confusable. Gray cells are confusables due to the links between characters and groups. Cells without a letter are disallowed. White cells are unreachable.

Cyan rows are characters that not confusable. Light-blue rows are characters that are in the same group as one of highlighted headers.

raffy · November 26, 2022, 12:33am

I’ve made an attempt to organize the errors so they make sense. I try to fail as early as possible using left-to-right processing.

The distribution of errors is dependent on the sequence of the checks so this isn’t representative of the type of error, just the frequency that an end-user would see it.

The following is a tally from 2M names, taking the prefix of each error (before the first colon).

  11543 disallowed character
   7419 (not an error: different norm)
   1024 illegal mixture
    866 underscore allowed only at start
    349 whole-script confusable
    157 invalid label extension
    100 fenced
     32 too many combining marks
      1 emoji + combining mark

Fenced:
     43 trailing apostrophe
     38 leading apostrophe
     10 leading fraction slash
      4 trailing fraction slash
      2 leading middle dot
      2 adjacent fraction slash + fraction slash
      1 adjacent apostrophe + apostrophe

Assuming I don’t encounter another problem, I should have updated error reports grouped by error type for easy review very soon.

raffy · November 26, 2022, 11:30pm

Preliminary error reports by type for easy inspection:

Non-trivial labels (not done yet)
Different norms (not done yet)
Disallowed Characters
Illegal Mixtures
Illegal Placement
Whole-script Confusables
Excess Combining Marks
tally.json (also includes: diff norms, underscore errors, and label-extension errors)

Comparison report

Might be useful: ENS Emoji Frequency

cazer.eth · November 27, 2022, 9:37pm

201A (‚) SINGLE LOW-9 QUOTATION MARK should be valid.

Allows for run on digit sequences such as 888888888 to be induced into their much more readable form; 888‚888‚888

Theth.eth · November 28, 2022, 12:50am

Shouldn’t happen as it is too close to . which would signify a subdomain

cazer.eth · November 28, 2022, 12:53am

To avoid that issue of collusion with the full-stop, we should enforce:

Can’t touch another 201A
Can’t start label or end label
Can’t touch an emoji
Must be in between digits

Mapping 002C to 201A is ideal.

raffy · November 29, 2022, 8:49am

I split the Different Norms report into categories based on the type of mapping required (Arabic, Hyphen, etc.). If a name required more than one technique, I put in the “everything else” bucket. (For example, a name could have both an incorrect hyphen and apostrophe – fortunately, most only have a single issue.) I removed the trivial differences (like ASCII casing.)

Quick overview:

42% are Arabic digits
26% are improperly normalized emoji
12% are wrong hyphen
11% are wrong apostrophe
3% are circled/squared ASCII

I’m not particularly fond of the circle/double-circled/negative-circled mappings (I’ve mentioned this before) but I’m unsure what should be done. IDNA only maps the circled-letters (Ⓐ) and leaves the rest alone. I decided to map all of the remaining types except for the Negative Squared digits (which overlap with emoji.) Maybe they should just be disallowed instead? Two sets of negative circled digits is clearly wrong. Negative Circled and Squared letters are very similar.

The actual mappings rules can be found here: chars-mapped.js

cazer.eth · December 5, 2022, 1:15am

#4 in of Disallowed Characters 201A ( ‚ ) Single Low-9 Quotation Mark should be allowed as Valid

In English, we use commas to organize numbers greater than 999. We use a comma every third digit from the right.

More than 50,000 people turned up to protest.

The comma every third digit is sometimes known as a “thousands-separator.” Make sure you don’t include a space on either side of this comma.

Correct:

We will walk 10,000 miles.

Incorrect:

We will walk 10000 miles.
We will walk 10, 000 miles.
We will walk 10 , 000 miles.
We will walk 10 ,000 miles.

Reference Source

Why 201A for ENS?

It would be the most feasible option to map 002C (the Keyboard Comma) into 201A
We would see a surplus growth in ENS because of the prime example of the 10K Club & 100K Club
The 10K/100K Clubs .eth domains, to me, are related to real estate property numbers and zip codes due to my prior property management experience. Respectfully, there could be a topic illustrating that 8,345.eth, creating a need for a more detailed explanation.
8,345.eth would bring an ideal digital identity .eth name in particular because of how the digit comma separator precisely organizes the thousandths, millionths, and so on.
8345.eth to me is the representative example of an address of a certain real estate property.
78000.eth is another example of a number sequence that has an established meaning in reality. It has more of a zip-code feel than a digital identity name.
78,000.eth has a more calm and collected style if I were to personify myself as that identity.
All in all, the addition of symbol 201A will be very beneficial to the ENS Ecosystem and the DAO.

1. This would bring in more registration revenue for ENS.
2. Allows for run-on sequences of digits to be easily readable and easier to send funds.
3. We could allow beginners wanting to get involved in ENS to have a good access point to become a part of the ENS ecosystem, with a good opportunity to do so. For example, we saw a huge spike in new users participating in the ENS ecosystem when 10K Club/100K Club became of existence.
4. Allows for a better intrinsic experience when utilizing the digit comma domain as a digital identity.
5. A better overall experience for the user would be granted by adding 201A when digits are used as normalized domains.

serenae · December 5, 2022, 1:38am

Hmm, I generally disagree with allowing comma or comma-likes as I think there could be too many potential conflicts out there. Mapping is also something that should not be taken lightly, as you can’t unmap, not without making breaking changes.

Not every character used in everyday human speech makes sense to use in a global naming system like this. And registration revenue does not matter here, I don’t think that should factor in at all when it comes to creating a robust set of normalization rules.

Theth.eth · December 6, 2022, 12:03am

Totally agree with not allowing it

raffy · January 12, 2023, 11:02am

Once again, I apologize for the slow turnaround.

The spec has been pretty stable and I’m satisfied with how it functions. I’ve had time to review the breakdown reports and I think they look reasonable.

A few months ago, an earlier version of ens-normalize.js (lacking validation/confusables) was incorporated into ethers.js. Marketplaces like ens.vision already use ens-normalize.js. I’ve been monitoring daily registrations and have seen a pretty large reduction in invalid names.

I’ve received many DMs with questions and concerns and appreciate all the feedback.

I’ve updated my eth-ens-namehash branch with the latest ens-normalize.js logic. I decided to only include the minimal code necessary (rather than all of the derivation and validation stuff.) I updated the PR too.

Everything else can be found in my ens-normalize.js repo.

At the moment, my ENSIP document is still unfinished. I’ve been struggling to concisely describe everything. Possibly the formatting style I chose—nested (un)ordered lists instead of paragraphs or pseudo-code—is just too limiting. I just need to complete the section on how I resolve confusables and then I’ll publish it.

I don’t exactly know what the next steps are in this process but I think there’s enough code, tests, and reports to get the ball rolling.

Here are some notes to help pinpoint an area of disagreement:

I don’t think any of my emoji choices are controversial. AFAIK ens-normalize.js is the only normalization library which enforces correct emoji sequencing. I’m not aware of any exploits or spoofs beyond emoji that actually look similar.
I think my hyphen changes have the correct balance of reducing confusables but providing good UX to users.
I think _ and $ are great additions. I allowed all modern currency symbols. The correct Ethereum symbol is Greek Xi. The other triple-bar characters are confusable.
I disabled Braille, Sign Writing, Linear A, and Linear B scripts.
I disabled all Combining Characters, Box Drawings, Di/Tri/Tetra/Hex-Grams, Small Capitals, Turned Characters, Musical Symbols, and other esoteric characters.
I disabled many formatting characters like digraphs and ligatures.
I disabled nearly all punctuation and phonetic characters.
I heavily curated the allowed combining marks in Latin-like scripts. Not all exemplar recommendations are allowed. I used prior registrations and scripted dictionaries to decide the limit on the number of adjacent combining marks, however native users might have different opinions.
I disabled many obsolete, depreciated, and archaic characters.
For non-emoji symbols, my convention was always to choose the heavy variant if available, rather than have 5+ differently-weighted variants of the same symbol.
I merged upper- and lowercase confusables. For example, scripts with a Capital G-like character (with no lower-case equivalent) confuse with Latin g since Latin G is casefolded to g.
I’m aware there are multiple-character confusable but I’m not aware of a reasonably exhaustive list of known cases. For the ones I’m aware of, I don’t think the implementation complexity is worth it.
There are still unresolved confusables where I can’t decide which character is the preferred one. The default has been to allow them. You can seem many of them by browsing the Han group using my confusable explainer.
There are features that ended up being dead code because they aren’t needed (but might be needed in the future.) Instead of including that code, it is commented out and there are checks during the derive process which fail if the code is required. For example, after all of the combining mark + confusable logic is applied, there aren’t any whitelisted multi-character combining mark sequences that don’t collapse (NFC) into a single character.
ADDED Non-IDNA Mappings — discussed above.. I think my solution is more consistent but I think disallowing these characters is also valid.

Nearly all of the decisions above can be found in /derive/rules/ and all of the necessary data for implementation can be found in /derive/output. There also is a text log of the derive process which annotates all of the changes relative to IDNA.

Resolver demo and npm package are using the latest code.

rayw · January 19, 2023, 5:07pm

Hello,

I don’t mean to rush the process but I was wondering if we could get an ETA on normalization being finalized. With the subdomain wrapper releasing soon, I’m assuming the idea is to push normalization prior to it.

raffy · January 19, 2023, 7:51pm

A concern related to subdomains that can be addressed by normalization:

should "...a...eth" become "a.eth"?

I think null labels (0-length) should collapse. The null label is perfectly valid on a subdomain but seems kinda silly—I can’t think of a use-case.

serenae · January 19, 2023, 8:43pm

Maybe that should just throw an error as invalid instead.

Perhaps it’ll catch some fat-finger or copy/paste mistakes, like if someone wants a.name.eth but types in .name.eth. The current behavior on both the manager app and Metamask is to throw an error: