ENS Name Normalization 2nd

raffy · September 25, 2022, 9:19pm

Previous thread

Please provide constructive input/questions/comments/concerns and include examples/use-cases.

Resources

Unicode
Existing Specification
- @ensdomains/eth-ens-namehash.js
- EIP #137 / ENSIP #1
Proposed Specification
- ENSIP: @adraffy/ensip-norm
- @adraffy/ens-normalize.js
- Related
  - Testing: @adraffy/ens-ens-norm-tests
  - Solidity: @adraffy/ens-ens-norm-research
Applications
- Demo: Resolver – reflects latest changes
- Supported Emoji – query emoji information
- Demo: Punycode

Prior Discussion Highlights

TODO

raffy · September 26, 2022, 7:25am

Ether symbol, Greek Xi (Ξ➔ξ), should probably be changed from Greek to Common script so it’s available in all names. There should probably be a rule file which lets us re-script a character (most likely change would be X → Common.)
Does anyone have any opinions on Regional Indicators? Should we only allow valid regional flags (pairs of Regional Indicators)
- Only 260 of 676 combinations would be valid
- All odd-length sequences would be invalid
- UTS-51: “A singleton Regional Indicator character is not a well-formed emoji flag sequence”
- Stats: 16357 names with regionals, 1016 pure regionals, only 87 invalid
To reduce the character space, we could force rare scripts (Unicode Excluded) to be pure labels (single script). For example, we could restrict hieroglyphics to “pure” labels. Potentially, pure labels can also include emoji?
- Latin/Greek/Cyrillic – can’t mix with eachother + whole-script confusables
- Excluded scripts – must be pure?
- Everything else can mix (for now)
I also think these whole-script confusable rules are too strict:
- Greek / Cyrillic
- The following is all because of:
  411 (Б) CYRILLIC CAPITAL LETTER BE == b̄ [62 (b) LATIN SMALL LETTER B,304 (x̄) COMBINING MACRON]
  
  image1156×460 65.2 KB
One more update on these 2 disabled emoji: I don’t expect the text-presentation default rendering for and to change (they’ll be uncolored !! and !? forever) so I think the best UX is to map → and → .
What about characters that aren’t supported natively on Mac/Win/Nix? Should that be part of the consideration? For example, 26E4 (⛤) PENTAGRAM doesn’t work on Mac.

serenae · September 26, 2022, 2:23pm

Is that because you think it should be treated as more of a currency symbol? I guess that’s alright
Only valid regional character combinations makes sense, it’s similar to how only valid RGI emoji combinations are allowed
Yes that sounds good for that table of excluded scripts. But I think emojis should still be allowed to mix with other characters
I think the whole-script confusable stuff might be too much yeah. I think just mixed-script, so that Latin/Greek/Cyrillic aren’t mixed together.
Makes sense
I don’t think characters should be excluded just because a particular platform doesn’t display them, unless there are other good reasons

raffy · September 27, 2022, 8:07am

Invalid Flag Sequences / Single Regionals Disallowed
All excluded scripts cannot mix:

image726×75 7.13 KB

except with emoji:
Ether symbol is now common script:
Pentagram and other characters that don’t work on Mac are restored:

Player1_DYORGames · September 27, 2022, 4:20pm

Excited to see the new threat started and everything consolidated!

As we continue to move forward with proposals regarding normalization is my understanding correct that even if this first round of updates gets pushed through, that the ENS team only wants to roll out the normalization standards once and not have to revisit the topic for another update?

I ask because I’m hoping that it’s not too late to sort out some other characters and use cases before that window closes (within reason of course).

two characters specifically that I think should be allowed are & and + in the middle of words. official brands (who can sometimes be very sensitive about accurate notation of their names) will be looking for this, such as johnson&johnson, black+decker, A&M Records, arm&hammer, AT&T, Kraft Sports + Entertainment. ect.

outside of brand names specifically, these symbols are used pretty commonly as joiners (more so the & than the +) and could be applied in many two word combinations.

Unless there is some kind of practical syntax issues with including these in the middle of words, (which if there is please speak up) is it easily possible to include these two characters in the normalization update?

nick.eth · September 28, 2022, 3:08am

I think revisiting is inevitable. Right now I’d like us to settle on a reasonable starting point, erring on the side of being too restrictive, since it’s easier to make the rules less restrictive over time than it is to add new restrictions.

I don’t think these can ever be part of a valid name, as they’re valid characters in URLs (+ replaces a space in paths and query strings, and & separates query parameters).

raffy · October 2, 2022, 11:14pm

Thoughts on disallowing skin tone and hair modifiers as singular emoji?

Stats: 6701 names with modifiers (98% valid), 18 pure modifier names, 109 names contain a singular modifiers.

There is ambiguity around trailing skin-color with ZWJ sequences that contain modifiable characters.

There exist dedicated ZWJ sequences with skin-color styling:
The following ZWJ sequence is followed by a skin-color modifier, however it presents like a single emoji, with one golden human and one skin-colored human.

According to the spec, this should present like:

Disallowing singular modifiers would prevent this problem.

A separate decision would be to whitelist these mixed modifier ZWJ sequences (I say no.)

serenae · October 3, 2022, 1:43pm

I would lean towards adhering to the spec and disallowing the singular modifiers.

raffy · October 5, 2022, 5:01am

The following was provided to me by zombiehacker who can’t post on the forums yet but wanted to comment about the modifier issue:

Today we are discussing RGI emojis, which stands for Recommended for General interchange. What is and isn’t recommended for general use has changed greatly throughout the existence of humanity and the internet, so what is recommended today may not be in the future.

With that said, while I fully understand this proposal from a programmers perspective, I have 28 years of experience programming, I get the thought process.

The 109 pre-existing non-RGI emojis, were coined as mistake emojis by another holder of these non-rgi emojis. Mistakes and errors have been the most valuable items throughout human history.

The red cheeks Pikachu Pokemon card has sold for $10,000, why? Because it was a mistake and is supposed to have yellow cheeks, which sells for $1000.
https://news.tcgprice.com/10-most-valuable-pokemon-error-cards/
The 1922 penny without the D mark are worth $500, a penny is worth $500, why? A mistake, an error, or simply put not conforming with recommended general interchange of currency. 15 pennies out of 1 billion in 1945 were struck wrong when being made, making them sale up to 1 million dollars. If we remove mistakes, we are removing what it is that humans love, the imperfections of life.

One of the rarest Dungeon and Dragon characters is the colossal red dragon worth $649. Just think, if we removed a character from D&D or pokemon and made all previous characters and or cards not even look the same retroactively, the players would not be happy. https://www.thegamer.com/dungeons-and-dragons-miniatures-worth/
What we’re dealing with here is bigger than an RGI or non-RGI issue, we are dealing with finances, the human mind, collectibles, and rarity. Only 109 of these mistakes currently exist. Does this mean every mistake is worth more than the original? No, it doesn’t, humans are unique in what they put value to.

How do we stop this going forward?

Those that are already minted, these 109 should be put into an allow-list of supported non-RGI’s because they already have been minted and already exist. I don’t want to dive into the topic of censorship, but breaking something for being different, would demonstrate that you have the ability to censor words or phrases, dictators, or any ruler, or province doesn’t like.

Going forward anything that does not comply with the RGI sequence should be blocked before minted, anyone who tries to cheat and bypass the system, those should also be voided. What exists should always exist.

raffy · October 5, 2022, 6:08am

I have disabled singular skin modifiers:
I left singular hair modifiers () enabled as they function slightly differently and don’t seem to collapse/merge into existing sequences (as it requires an additional ZWJ).
For reference, all disabled emoji can be found here along with their fate (mapped, invalid).
I’ve added many more characters to the disallowed list although still more work is needed. Most of these changes are for extremely rare/never used characters that are legal in IDNA.
I marked many more symbols as isolates to prevent abuse.
I want to add the limited-use scripts to the set of scripts that can’t mix.
I think we should relax some more Greek characters to Common script:
- Latin µ is mapped to Greek μ by IDNA
- π (Greek-only) is very common
- Maybe: σ (Greek-only)
- For example, ω shouldn’t be mapped because there are Latin/Greek/Cyrillic variants.
If we wanted tilde (7E (~) TILDE), I think it’s possible use 223C (∼) TILDE OPERATOR and we could provide a convenience mapping like apostrophe, otherwise all the tilde-likes should be disallowed.

serenae · October 5, 2022, 1:51pm

I think non-RGI emoji sequences should continue to be disallowed. Some of those could be RGI in the future, and can be added later, like when we do an update for a new Unicode version or something. The arguments about misprinted coins or cards don’t make sense at all to me, that is a completely different circumstance. Any name that was minted will continue to exist and the owner will still own it, that cannot change. The only thing being tweaked here is whether the name has valid metadata or not. We already know that there are going to be plenty of names that were valid on the website in the past which are now invalid, and I believe TNL is working on some plan to issue refunds for those.

===

Adding limited-use scripts to unmixable makes sense
π is fine at least, not sure about others
That sounds fine to me about the tilde operator, so you’d also make all the other tilde-likes map to 223C as well?

Theth.eth · October 6, 2022, 12:08pm

I use ~ a lot in my work, but I feel it’s a confusable for - for a lot of people when they see it visually

Just my opinion

I’m still also not sure about the bullets, I know one is a circle and one is square, but they kinda can be confused as an Arabic 0 in my view, you also have the middle dots being valid just now

Also is the apostrophe ’ (u+2019) not a confusable for ׳ (u+05f3)

father.eth · October 10, 2022, 5:30pm

~ and • are not confusables.

All of the apostrophe confusables should be mapped to ’ (U+2019).

Theth.eth · October 11, 2022, 1:02am

If you map all apostrophe confusables to U+2019, then you are excluding Hebrew words

In my view, this is not right

Braille isn’t going through just now, but in the future will it??

Braille is made up from round dots in various combinations, yep confusable for braille

I’m not bothered if Braille never comes to ENS, I’m still trying to fully work out how it would work, but the decision has to be made now, if bullets are allowed then the door on braille in the future has already been closed in my view

father.eth · October 11, 2022, 6:28pm

Your opinion on confusables is skewed.

Ever since the underscore debate your opinons have been subpar.

raffy · October 11, 2022, 8:32pm

No. Geresh is Hebrew. However, I’m not sure how to handle the RTL languages. I haven’t been able to get much feedback regarding CheckBidi or if a label could be mixed direction (mixed direction names are really weird to manipulate.) Hebrew could easily be script restricted (which makes direction irrelevant) – I’m just not sure if that’s acceptable.

I think we should revisit the non-RGI whitelisted emoji. I now think it’s a mistake to whitelist any non-RGI emoji.

Reference: UTS-51: Emoji Sets

Note: this suggests skin color modifiers should be allowed as singular, but doesn’t account for rendering/platform differences which cause the modifier to get absorbed into other characters.
Note: this also suggests that Emoji which aren’t Basic_Emoji aren’t RGI. This only includes digits (+FE0F) and single regionals, which are both disallowed.

Character Viewer (for desktop)

serenae · October 11, 2022, 9:13pm

I didn’t realize any non-RGI were whitelisted, but yeah I agree that only RGI emojis should be allowed. If any become RGI in the future, then they can be added in future updates, such as when we update for a new Unicode version.

Ronald · October 13, 2022, 5:46am

Hebrew and Arabic both read numbers left to right, so there are mixed names possible always. It breaks brains, and many programs. Check out the El Al logo. It reads the same RTL and LTR. Cool use case to ponder.

Where are we now with hyphen rules? It’s hard to follow with all the versions. Is it still only leading ones? They are getting more attention these days in the speculation world.

This goes really deep into a lot a lot of things. I concur with zombiehacker. If there is a way to roll out in stages of languages after consulting with multiple academics in each one it would be good. There’s a lot of crazy stuff with various languages, and the ENS should support them all. Language unification is one of the worst things that could happen to humanity, and the DNS took us quite a ways to that point. Decentralization is really based on every language and culture having a proportionally equally voice.

raffy · October 13, 2022, 6:15pm

Hyphens are anywhere except **-- (position 2,3 if the label is ASCII.). Reasonable hyphen-like characters are mapped to hyphen. Remaining dash-like characters are disallowed.

I have focused on Latin/Common scripts, punctuation, emoji, and symbol-like characters. Excluded/limited-use scripts have been restricted. The ASCII confusables scripts (Greek/Cyrillic) can’t mix or use whole-script confusables. Combining marks are restricted to one. Many marks are disabled.

Ronald · October 17, 2022, 5:12am

So to clarify, would -0-.eth be invalid, but -00-.eth would be?

I forgot about the leading underscores, is that still just leading?

After thinking about this for some weeks, I really would think everything through with the mappings/groupings. I can think of weird things, like in Dutch ij is one letter, and their keyboard has it as one letter, but everyone Dutch doesn’t want a Dutch keyboard because everyone speaks English. I remember learning Dutch for a year after I lost my phone in Amsterdam in a taxi. I replaced it with the same one, except the Windows CE version it was in was locked to Dutch language. It was a lot of money, so I just learned the Dutch UI instead of buying another phone. My Dutch friends laughed a lot. Microsoft didn’t get that localizing for Dutch meant putting things in English.

The context of that long story matters because even in that localization there wasn’t ij as a letter in the primary keyboard, it was an alternate still, so I always just typed the two letters. This kind of scenario where you can spell things both ways can be infuriating when mapping languages to specific keysets, or making word lists.

I write in Spanish often (poorly), but I never use accents or the ñ because I have an English keyboard. In fact, the English speaking city where I crew up had a street name in Spanish with an ñ, and 30 years on, now it’s replaced with just an n, and everyone says the street name wrong. Kind of like how everyone says Crimea wrong today in English (Hint: it never rhymed with crime before the 2000s). A simple gap in history, and new history was made. I hope the same thing doesn’t happen to more obscure languages just because someone not familiar with the language makes a booboo. I’m not going to take any political stance here, but I do think perception has shifted based on the pronouncing of Crimea, and writing things as Karabakh instead of Artsakh. Almost nobody means harm doing it, but yet ignorance got in there and massaged history for the eyes of outsiders.

Æ and Œ are both valid letters in English. Just because English changed to omit these letters most of the time does not make them less valid. It’s similar to church Russian or old Russian. Ij in Dutch has kind of one the way of Æ and Œ, but in our lifetime. It’s why it was already relegated on the keyboard circa 2006. Should Œ map to oe? Are they the same?

I know how to pronounce them, but I have no idea how to even argue for what to do if there were a mapping of just English letters. Who decides the English alphabet? Is it all of the English words that ever existed? Is it just what is currently taught in schools? What about dialects of English that have weird digraphs?