ENS Name Normalization

Unsure. The JS code is incredibly easy to audit or re-implement yourself. The ref-impl and ens-normalize share the same normalization loop that directly follows the ENSIP processing section. I know there’s a Go implementation. I’m not sure how many other implementations exist. The smart contract route might also be best for user-facing applications.


After thinking about this, it might be better to enforce this through validation.


I computed some additional stats for the 1.4M names:

  • There are 1320 collisions JSON
    • 461 trivial (just casing permutation)
    • 669 pure arabic numerals
    • 160 non-trivial (everything else) which looks like hyphens and illegal emoji
  • Only 4.7% (66819 of 1410818 unique-valid names) are non-basic (/^[a-z0-9.-]+$/ + Valid Emoji) JSON
    • 66808 if basic includes leading hyphen
    • 54927 if basic includes anywhere hyphen
  • Only 4.2% (60584) if I include single-character text-presentation emoji (non-colored) too
    • Even less if you include the pictographs (non-colored non-emoji)
  • The current smart contract incorrectly fails on 197 (0.014%), of which 196 are supposed to be valid. Once NFC is fully implemented, that will be 0 (100% match with reference implementation.)
4 Likes

They are both valid for normalization.

  • ฿ [E3F] is a Thai Common symbol.
  • ₿ [20BF] is a Common Symbol.

edit: oops, they’re both in Common. I think they both should normalize but it’s unclear if they’re both safe. The community will have to decide if these or others are confusable and result in an unsafe label. My general feeling is that currency symbols aren’t confusing.

For validation, a validator will determine if this is a confusable based on all the characters in the label, ignoring colored emoji.

The trivial Thai validator is just Thai only.
The smarter Thai validator is probably Thai + Latin + Common - Confusables.

The official confusables restricted to those charsets for Thai looks like this, where the non-green cells require a decision.

1 Like

:joy: :joy: :joy:

Confusables aren’t addressed by my normalization proposal. I encountered too many edge cases while trying to develop a complete confusable-free solution.

My suggestion is that we have hard errors for names (normalization) that use illegal constructions (disallowed characters, illegal emoji, invisible characters, etc.) and soft errors for names that are unsafe/confusable (validation).

This allows the normalization spec to standardize and “unsafe” names still work. We can expand the universe of safe names until nearly all reasonable names are covered. We can start with Alphanumeric ASCII + colored emoji which I claim are safe. If we follow the distribution of registered names, it should be easy to hit 99%+ coverage.


There are a large set of single characters (~2K) that consist of default text-presentation emoji (, eg. those that appear uncolored) and non-emoji pictographs (☏️). There’s probably a set of these that are safe to use in any name like colored emoji. However, some of these aren’t unique and require a decision: ❤ [2764] vs ♥ [2665]. On Mac, it appears that some of these already buck the Unicode convention and appear colored (eg. ) whereas ↖ [2196] vs ↖️ [2196 FE0F] does not. Determining which of these are safe covers another 1% of names.

4 Likes

Is there any chance I’ll be getting a refund for these two ens names I haven’t been able to sell or do anything with, or even renue? :trinidad_tobago:🇹.eth and 🇮🇮🇮.eth both done in all emojis. I think it has to do with the first two T’s turning into the flag

Still think you are making a mistake with the hyphen and underscore as I think they are confusables and that is the reason the underscore was stopped in web2, but we have minted out the ‘-#-‘ & ‘#’ names etc etc, so either just wasted some money or now have some very rare single digit names for the price of a 3 character name

Not going to keep going on about it as know my opinion doesn’t count, but I feel
It will come back to bite ENS in the ass in the future

Edit:

Underscores still not working, even in quotation marks

1 Like

I agree with you that symbols that are normally little used will not be an added value in the ens ecosystem, symbols like: [] >/*+" etc. but hypens and underscores are the mostly used characters in general. For example in usernames and do clearly add value to ens domains/ecosystem. Why are people saying that they are confusable? I mean everyone can see that this _ is een underscore and this - a hyphen. This will 100 percent help with the adoption of ENS. I think that it will be one of the main challenges at the moment/future is adoption and with the use of hypens and underscore we just lower the threshold for newcomers.

1 Like

People keep falling for approval scams in fake NFT airdrops still. Catering to the lowest common denominator isn’t wise.

This is complicated, and it’s going to remain complicated. Web 2.0 just used ASCII. That’s a nice simple solution for 1984. They did great, and Unicode didn’t even get invented until some years later. What would Web 2.0 be now if Unicode came before domain names?

It took nearly 40 years to stop discriminating against non-English speaking people on the internet with DNS. The ENS is governed in English. I think we should be cognizant of the fact we might actually wield great power over the future of internet usage. Anglo-centrism has made the world a smaller place in some ways, but it is silly to continue on with many conventions that limit us as humans. Confusion is just going to be part of life now. We will just have to deal with getting more knowledgable about how society works.

The people falling victim to elementary scams is not a nice thing to see, but we don’t restrict who you can dial in the telephone system just because elderly people get targeted by scams of confusion, manipulation, and misrepresentation.

1 Like

Web2 didn’t just use ASCII

There are emojis in web2, even emoji .com domains

Hi peeps, just joined the convo here.

Just looked into your domain names, and it seems that you just want to cut off the competition to save your own bag (domain names) since you own lots of names/numbers with hypens. I own zero hyphens and zero underscores domains but my usernames on different platforms includes underscores and hyphens and I see no point in confusing them. Maybe in the future I will create some but I’m good for now with my 999club domains.

I’ve hedged either way, so what ever way the cookie crumbles it’ll be fine for me

Please guys, none of this. It’s ad hominem because you’re not arguing anything on technical merit, instead you’re seeking to discredit someone purely based on the domains they hold.

3 Likes

Cheers, I wasn’t going to say anything, I am used to it from them, also a cheap way to pump their club at the end

I actually laugh at things like this now, shows they are are worried about their bags

Since 2003, but it doesn’t mean registrars allowed it or absolutely anything was consistent. For 10 years or so allegedly there have been IDNs, but nobody uses them. I would be hesitant to say it was bungled because when something is standardized it is better to keep it that way, and restricting them to specific country based TLDs.

Please guys, none of this. It’s ad hominem because you’re not arguing anything on technical merit, instead you’re seeking to discredit someone purely based on the domains they hold.

This is mostly a political forum, and that’s not an ad hominem as I see it. It’s perfectly fair to argue politically. In fact the main arguments for and against are not technical. There’s no technical limitation really, it’s all philosophical which comes down to politics. The only technical things involved are implementation, and compatibility issues.

We all want to stop scams, and while I disagreed with the Theth.eth’s - vs _ confusion argument, I definitely see it as valid. The effort you go to restrict domains to allow or disallow certain behaviors because of underlying empirical observations is pure politics. Motives being transparent or questioned is a key tenet of democracy. There’s also nothing wrong with having a motive of self-interest.

I’m arguing for a broader picture as many have made arguments, which in my opinion, show a bit too narrow of a worldview. Not everyone is a gamer or gets 90s culture in IRC/AOL/ICQ handles. Not everyone codes and even people who do don’t know every language’s conventions. There’s a few sticky places where inclusivity touches security concerns.

Not touched in the discussion was the role of + for example. Only last night while researching ways to index Web3 sites did I discover some things about - versus _ and + in some libraries. It would be nice if at least subdomains could have + as a compliment to -. Sure, you can access the text records of a domain, but as ENS domains will likely be subdomain heavy, it would be nice to have conventions that allow faster indexing with keyworded subdomains.

That is the same level of low-caliber ad hominem I warned him about. Don’t do that either please, come on.

Saying “you just want to save your own bags” is nothing but a personal attack. You are pretending to know the heart of the other person, and attacking them based on your own assumptions. Feel free to argue politically or technically, but keep personal attacks out of it.

The rest is fine, let’s just stay on topic and civil please

@raffy What about domains like this: ENS Resolver


I don’t know why they render vertically, but is there a possibility make some normalization for them?

Combining marks stack/grow in different directions. Some build towers, some appear at the same spot, etc. Some marks obscure the underlying character. Some marks are very tiny.

  • i [69] vs ı̇ [131 307] (dottless i + dot)

Marks of the same class don’t reorder and form distinct names:

  • 1̉̇ [31 309 307] vs 1̇̉ [31 307 309]

If you convert every valid registered name to NFD and then count sequences of adjacent combining marks, only 41K names have 1 mark, 400 names have 2, and 400 have 3+.

UAX 39 suggests disallowing duplicate marks and only allowing 4 per character. I mentioned this issue a few times in this thread but ultimately I think enforcement should happen at validation level, instead of encoding this logic into the normalization spec.

For Latin, probably 1 CM is sufficient (and many shouldn’t validate, eg. a̲ [332] COMBINING LOW LINE), unless we want some very complicated logic that rejects duplicates and non-stacking/overlapping variations and blacklists some combinations. I’m not exactly sure what marks are needed in other scripts but we should try to keep it as minimal as possible.

IDNA says labels shouldn’t start with a CM. Additionally, I think a CM following an emoji is invalid too.

1 Like

I found a small issue: there are 8 hyphen-like characters that get mapped to other hyphen-likes for which we mapped to hyphen. There are 17 names that use these characters. These need mapped to hyphen too:

  • ‐ [2010]
  • ‒ [2012]
  • ― [2015]
  • ⁻[207B]
  • ₋[208B]
  • ︱[FE31]
  • ︲[FE32]
  • ﹘[FE58]

Edit: This should be fixed in the ENSIP, reference implementation, and ens-normalize.js library.

4 Likes

This proves my point yet again

The underscore should map to the hyphen

They stopped using the underscore in Web2 to as it was a confusable for the hyphen

Here we have vertical lines being mapped to the hyphen as they are confusables

Which looks more like a hyphen?? A vertical line or an underscore ?? I know which one I think does

People think it’s about me protecting my bags, but it’s not, I’m just stating facts……

1 Like

A hyphen is a hyphen, an underscore is an underscore. Your thesis that these are confusable doesn’t holds much water.

Repeating the same thing continuously doesn’t changes the fact that there is a plethora of reasons why underscore is a good idea for ENS dynamism, reasons that far outweigh your concerns imo.

At this point if you may or may not be protecting your bags is irrelevant, If you are looking out for ENS ecosystem that’s great, but the reasons not to allow underscore (at least those that I’ve read here) are objectively not compelling enough (and this seems to be what the majority of people that have voiced their opinion here seem to believe too).

The work being done here protect against most (if not all) your concerns.

1 Like