ENS Name Normalization

serenae · May 6, 2022, 1:04am

It kinda already is a vice-versa situation, since the floodgates are open right now on the ENS manager app. People are registering all kinds of weird things haha.

Even some not-so-weird things like the degree symbol ° that the ENS manager app allows, will apparently be disallowed (and also invalid per the metadata service) in the future, if the current build is to go by:

nick.eth · May 6, 2022, 1:14am

Yes, unfortunately true. The team is working on synchronising the app and the metadata service using our intermediate rules, so that people can’t register names that the metadata service treats as invalid, while we wait for the new normalisation function to be ready.

serenae · May 6, 2022, 1:15am

That would be awesome as a stopgap solution, thank you.

raffy · May 6, 2022, 1:33am

On output, no, because they’re all capitals (and Lira ₤ is not in this set.)
On input, ¥ is in the Y- confusable, and € is in C- confusable.

I agree there’s many symbols that could go either way. IDNA 2003 blocks almost nothing. IDNA 2008 blocks more. Many unique symbols are still disabled and many exotic/adorned/confusable characters are still enabled (and treated as unique.) The confusables are very lacking and poorly implemented.

A perfect example is: ꭑ (AB51) Latin Small Letter Turned Ui

Does that map to ASCII m? No.
Is it confusable with ASCII m? No.
What does it confuse with? ṃ (1E43) Latin Small Letter M with Dot Below.

That’s just insane to me.

For ° (B0) Degree Sign:

Confusables across all scripts:

image1010×106 7.22 KB
Confusables for just Latin (which get mapped to ASCII):
- º (BA) Masculine Ordinal Indicator → o
- ᵒ (1D52) Modifier Letter Small O → o
- ⁰ (2070) Superscript Zero → 0
Both IDNA 2003 and 2008 disallow degree sign.

serenae · May 6, 2022, 1:56am

Makes sense for the degree sign, thanks!

zadok7 · May 6, 2022, 2:02pm

Would doing an update to the warning message make this a bit more understandable for those registering non-ASCII names @nick.eth:

Scrolling through this thread, name normalization looks super complex. While it’s being worked on an updated warning message may be helpful.

Wildfire · May 6, 2022, 2:31pm

Starting cautious and getting more permissive over time approach sounds like the wiser approach.
Although there is a bonanza of symbols that might not end up being allowed. It seems based on raffys reports that it luckily wont affect all that many ENS names.

Raffy´s ENS work has been impeccable with the ENS-IP, I believe we should adhere to his suggestions as much as possible (if not entirely). Nick has also been great overviewing the development of the new ENS-IP

I am sure the great minds behind ENS understand the situation at hand better than anyone and wants the new ENS-IP to be up and running as soon as possible.

P.D: Serenae I don’t know how much you are getting paid, but they should probably double it. Outstanding presence everywhere.

raffy · May 9, 2022, 8:11am

I see someone registered a 38894 character name (concatenation of integers 1-10000.) Also someone registered an enormous string of {110B} characters, which seems to break text/line/word/overflow-wrapping logic in all browsers (and my error report ).

I noticed someone registered Arabic numbered domains (eg. 123 → ١٢٣.eth). I’m not sure I agree with UTS-46 and CheckBidi here, and I think they’re wrong:

UTS: A Bidi domain name is a domain name containing at least one character with Bidi_Class R, AL, or AN. See [IDNA2008] RFC 5893, Section 1.4.
Bidi: The first character must be a character with Bidi property L, R, or AL. If it has the R or AL property, it is an RTL label; if it has the L property, it is an LTR label.

A missing exception is: if the label is 100% AN (Arabic Number), it should be valid, as the direction is irrelevant (edit: maybe I’m wrong for 123 vs 321 vs 123RTL?)

I think the ContextJ change mentioned earlier is correct. With this change, ZWNJ will never appear in output and ZWJ only appear in emoji sequences. If you don’t care about silently erroring, this is equivalent to ignoring ZWNJ globally.

I keep going back and forth on a few decisions but I think I’ve chosen correctly after working through more cases. For example, I wasn’t completely sure if FE0E in ZWJ should be disallowed. It’s against the spec, but we already must allow normal emoji not in ZWJ sequences to use any emoji styling (because those names have been already registered and we don’t know styling they intended.) The example of 🚴‍♂1F6B4 200D 2642 vs 🚴︎‍♂ 1F6B4 FE0E 200D 2642 vs 🚴︎♂ 1F6B4 FE0E 2642 I think makes it clear that FE0F in ZWJ is bad. ZWJ sequences should always be styled.

I created validation tests: it contains hand-written tests (from demo examples), randomly generated tests (that utilize mapped and complex emoji), and real-world tests (from registered names.)

I’m less bullish on enabling $ now that I see 💲 already exists. Similarly, I overlooked ₤ when considering £. To be conservative, maybe only ₿ should be enabled? (since it was used in many registered names)

I collected more ENS names (nearly 1M) although I still think I’m missing some. If someone has a complete registered name snapshot, please send it my way.

I added some DNS comments to my ENSIP. I also worked through a collision possibility and made a post about it. I added the correct Punycode encoding to the resolver demo and made a simple Punycode tool (since all online tools apply some kind of Unicode normalization.)

I’m think done making code changes. I need a few more days to finish organizing the ENSIP document.

As I mentioned before, I’m not planning to include the single-script confusable logic in this release. After the discussion about input vs output confusables, the overall approach needs revised. To quickly recap: normalization must be idempotent. If normalize(normalize(a)) != normalize(a), all kinds of weird stuff happens. If you only catch confusables on the output, then you can’t catch confusing inputs, which is what end-users actually interact with, eg. Ⅷ vs VIII → viii. If you only catch confusables on input, then you can circumvent confusable detection using mixed case input, that leads to confusable output. So you have to check both places. I also dislike having to make opinionated decisions about the fates of peoples’ names.

nick.eth · May 9, 2022, 8:20am

I still think this is fine, as long as we advise UIs to replace the input with the normalised output.

snesne · May 9, 2022, 2:23pm

Does that mean we can take the current implementation of the library and live demo as a status quo and build on that in the meantime until the update is ratified by the DAO and implemented in the ENS app? I ask because there are some efforts in Discord to program a unified filter and display application for emojis especially (and other subsets of ENS domain collections), but so far it has not been possible to publish this if the code changes are not final as to not confuse people that register and possible lose their domain through the application. Also since we have been waiting for something like this update for half a year now the and with the current ENS hype, we are very much eager on getting this out.

The live demo especially has now been the goto tool in the community for checking and verifying what is “okay to register” without getting the domain possibly invalidated in the future, so I wonder if it is fine to use the current state until the changes go through the DAO voting process and get implemented in the main client.

Wildfire · May 9, 2022, 2:43pm

I don’t think that’s the case. We are going to have to wait until the final report.

Just to bring up the most recent topic, currency symbols(which raffy is mentioning above and seems undecisive about them are resolving fine in the live demo tool- Personally I think they would be tremendous to have).

What I do wonder is what would happen to those that registered names through ENS that later become “invalid” take the currency example for instance. Are users getting refunded ?

I am sure there are more cases with other symbols but this is something I don’t see being addressed and sounds like the fairest thing to do also.

snesne · May 9, 2022, 6:12pm

I just had this idea what you could build with the currencies, which would actually be quite a functional improvement over the current system (1 wallet for everything) by utilizing subdomains.

domain.eth -> ₿.domain.eth, $.domain.eth, ¥.domain.eth, Ξ.domain.eth, €.domain.eth etc.

With the subdomains resolving to a contract or wallet that handles only the specific currency/ERC20.

You could have different wallets for different ERC20 stablecoins and other synthetic currencies that are tied to your main eth wallet. This would make it much easier to handle these different ERC20 tokens quite like a checking account does in real life. Also takes the fear out of sending big funds to a random hex (double/triple checking intensifies ), when a subdomain clearly defines where and what is supposed to be sent.

Also imagine if exchanges would use subdomains like these instead of giving out random hex addresses to the user to deposit funds. Would eliminate many issues with people sending the wrong ERC20 tokens to exchange addresses, now they can give out clearly defined subdomains that can communicate which assets are to be sent to what address.

The thought of this becoming a thing actually wants me to build something like that . Subdomains seem to be something that are not entirely explored by the community yet but they have so much potential in the future.

Wildfire · May 9, 2022, 7:39pm

Wow. This sounds beautiful and very functional. The big €$¥£ along ₿ and Ξ would be tremendous to have and be able to apply in something like this.

It could actually be a very hygienic practice if that trend caught up. Very accessible and renown symbols . Hopefully raffy and everyone can find a way to fit those in, that was original intention it seems.

raffy · May 10, 2022, 9:05pm

Ξ is already possible, it just isn’t normalized.

Additionally, Ricmoo made EIP-634 regarding display names and similar ideas were mentioned earlier in this thread.

aрe.eth with Cyrillic p is a perfect example of a malicious confusable. It is valid at the moment. It would be invalid with single-script output confusables logic.

💲₿💲💲₿💲.eth is valid if ₿ is allowed. abcd°.eth is not. I agree ° would be a nice character to have (along with many others.) This probably needs to wait for a more thorough review of all characters by the community. (I could see it either being its own character, or mapped to o.)

ユクシー.eth and マーシャドー.eth are both valid.

You can check the potential fate of these names here. We should probably have a separate thread for these issues.

Okay, I’ll post the corresponding error report tonight for single-script output confusables and then we can make a decision.

nick.eth · May 11, 2022, 3:57am

Please don’t post individual support requests to this thread.

raffy · May 11, 2022, 10:07am

Any thoughts on 2-in-1 characters (ꜳæꜵꜷꜹꜻꜽʤʣʥᴔꭁꭂʩǁʪɮʫʨꝷʦʧꜩɱᵯ) being confusable? eg. aa vs ꜳ. aa shouldn’t be confusable because it’s double ASCII.

Combining Marks (CM) modify how a character is presented:

å = 61 30A (where 30A is a CM)
å = E5

NFC is responsible for collapsing these together, eg. they both normalize to E5. For some characters, there is no combined glyph, eg. e̊ = 65 30A has no corresponding single character form.

Multiple CM can be attached to the same character, eg. ã̰ = 61 303 330. NFC is responsible for putting the CM in a canonical order.

You can stack CM on characters, eg. ã̃̃̃̃̃̃̃̃̃ and a̰̰̰̰̰̰̰̰̰̰.

Some CM stack without any visual indication, eg. a̸ (1x) vs a̸̸̸̸̸̸̸̸̸̸ (10x).
a̸̸.eth ≠ a̸̸̸.eth ≠ a̸̸̸̸.eth = ...

There’s currently 500 registered names with CM. We could disallow some of the malicious ones (underscore-like or very small, etc.)? We could disallow stacking?

nick.eth · May 11, 2022, 11:06pm

They don’t seem confusible to me; they look visually distinct. What does the confusible mapping you’re using say?

Do we need to disallow either? I’d rather only disallow things that have a high probability of deceiving people.

raffy · May 13, 2022, 9:19am

They’re all confusable with their separate character equivalents. However, I agree they’re probably fine. They’re easy to distinguish when monospaced.

The av confusable is the only one with two versions: ꜹ and ꜻ. So those probably need to confuse, unless we want to pick one as the preferred, or map one to the other.

Nope. They’re all currently valid. I’m just unsure how someone can tell a̸̸.eth and a̸̸̸.eth apart. Almost need a warning for “this name contains combining marks”.

The recommendations here seem reasonable.

raffy · May 13, 2022, 9:36am

That took much longer than expected – I went down a rabbit hole with UTS-39 and UAX-24.

I implemented the Highly Restrictive version. It’s less strict than the explicit single script version I had implemented before, so things like 1a〆.eth (ASCII+Han) work but aрe.eth (Cyrillic p) do not.

Latest Error Report (737 single script errors, .json). I sorted the errors by the subtype. The vast majority look malicious to me. It is very easy to add scripts combinations to permit additional exclusions. The demo is also running this version.

I’ll provide another error report once the confusable part is working correctly.

Interestingly, the single-script logic makes some of the ContextO rules unnecessary (Greek Keraia, Hebrew Geresh, Hebrew Gershayum) are now all impossible because they require additional scripts to violate. The Arabic-Indic rule will be removed once confusables are active. This leaves Middle Dot (l·l) and Katakana.

zadok7 · May 13, 2022, 10:29am