ENS Name Normalization

/ [2044] is on my “I’m not sure” list. It’s used in IDNA 2003 (and 2008) mapping for fractions (like ½). So if we disallow 2044, we also need to remove 20 mappings.

As of recently, there are about 250 names with 2044.

2 Likes

@raffy trivial:

import {ens_beautify, ens_normalize} from "@adraffy/ens_normalize";

will not resolve ens_beautify, but:

import {ens_beautify, ens_normalize} from "@adraffy/ens_normalize/src/lib";

will. Tested on Angular 2.0 v13.4

I incorporated these changes into the ENSIP and updated the corresponding libraries.

Damage report: ens_normalize (1.6.0) vs eth-ens-namehash (2.0.15) [1463921 labels] @ 2022-09-01T09:50:38.018Z

There’s some Hebrew, Devanagari, Thai names that are collateral damage from the CM change. It’s unclear if these are legitimate names or if they’re just decorated with superfluous CM.

w/r/t Unicode NF support: Node 14 with Unicode 13 now runs the validation tests successfully. Node 12 with Unicode 11 has one 1 error (on a decomposable codepoint that was added in Unicode 13.)

To implement the Emoji + CM rule, the emoji.json now includes single-codepoint default text-presentation emoji (they used to be in chars.json and were considered characters.) Ultimately, I think this is an improvement, because they are emoji, but it potentially breaks the assumption that all emoji are safe – instead, now all emoji of 2+ codepoints are safe.

I also removed the following CMs which appear to be underscore-like:
cm-tally.json

320 (x̠) Combining Minus Sign Below
332 (x̲) Combining Low Line
333 (x̳) Combining Double Low Line
347 (x͇) Combining Equals Sign Below
FE2B (x︫) Combining Macron Left Half Below
FE2C (x︬) Combining Macron Right Half Below
FE2D (x︭) Combining Conjoining Macron Below

Also, josiahadams brought up an issue about substring matching, which requires partial normalization. I added some comments to the ENSIP about fragments and included an example which uses ens_normalize_fragment() which is nearly-equivalent to the Processing step in the ENSIP. For example, this would let you prepare the fragment [303 39 FE0F 20E3] and "aa--" (which both fail full normalization.)

3 Likes

ENS Rego Bot isn’t picking up on this and just showing 0.eth

Had to go through ENSvisions recent regos to get how it’s being done

Another one:

1 Like

I wasn’t aware specifically of Horizontal Scan Line-# but I’m aware there are many symbols that require opinionated review.

I agree ⎺ ⎻ ⎼ ⎽ ⏤ ⎯ [23BA 23BB 23BC 23BD 23E4 23AF] need mapped to hyphen or disallowed. They should of been part of the hyphen mapping. Edit I say we (1) either disallow all 6, (2) disallow the first 4 and map the last 2 to hyphen, (3) map all to hyphen.

The policy should probably be: there’s one underscore, there’s many things that map to hyphen, or it’s disallowed.

IMO, the following nearby codepoints could be disallowed too:

  • ⎀⎁⎂⎃ — Latin-like
  • ⌿⍀⍳⍸⍴⍷⍵⍹⍺⍶ — APL Letter Like
  • ⍡⍢⍣⍤⍥⍨⍩ — APL + Double Dot
  • ⍫⍬⍭ — APL + Tidle
  • ⍪⍮⍘⍙⍚⍛⍜ — APL + Hyphen/Underscore
  • ⎜⎟⎢⎥⎨⎪⎬⎮⎸⎹⍿ — Vertical Bar
  • aa⏜aa⏝aa⏞aa⏟aa⏠aa⏡aa – Rotated Brackets (ignore the a’s)
3 Likes

I mapped all 6 of those characters to hyphen (option 3.) While it’s best to disallow characters so they can be revisited later (mapping should be used sparingly), I see no future for alternative hyphens and mapping is much better than the alternative.

I left the others unchanged. They’re low frequency usage at the moment JSON.

The latest version of ens_normalize.js includes NF 14.0.0 ending any Unicode issues. I was able to run 100% validation tests (unmodified) on Node 11 (which is on Unicode 11.) I also removed globalThis.atob() dependency (h/t makoto)

There’s now an ens_emoji() function which returns all of the emoji sequences.


I was incorrect here. There are some confusable emoji that eventually need to be reviewed regarding if some kind of warning is appropriate:

  1. 🚴‍♂🚴🏻🚴 Gender:[1F6B4 200D 2642] vs Skin:[1F6B4 1F3FB] vs Singular:[1F6B4]
  2. 🇺🇸 🇺🇲 US:[1F1FA 1F1F8] vs UM:[1F1FA 1F1F2]

For the flag cases, the beautifier now inserts 200B between regional indicators, which prevents them collapsing into flags. This makes US vs UM obvious and works for all similar flag-related issues.

image


Delta Report: ens_normalize (1.6.1) vs ens_normalize (1.6.3) [1473529 labels] @ 2022-09-09T08:57:10.840Z

8 Likes

U+0027 disallowed but U+2018 & U+2019 allowed

These are going to be a confusable to each other let alone ‘

2 Likes

Thanks @Theth.eth. Agreed, a bunch of that punctuation block should be disallowed.


General Punctuation

Disallowed

Summary
  1. 2000 ( ) EN QUAD
  2. 2001 ( ) EM QUAD
  3. 2002 ( ) EN SPACE
  4. 2003 ( ) EM SPACE
  5. 2004 ( ) THREE-PER-EM SPACE
  6. 2005 ( ) FOUR-PER-EM SPACE
  7. 2006 ( ) SIX-PER-EM SPACE
  8. 2007 ( ) FIGURE SPACE
  9. 2008 ( ) PUNCTUATION SPACE
  10. 2009 ( ) THIN SPACE
  11. 200A ( ) HAIR SPACE
  12. 200C (‌) ZERO WIDTH NON-JOINER
  13. 200D (‍) ZERO WIDTH JOINER
  14. 200E (‎) LEFT-TO-RIGHT MARK
  15. 200F (‏) RIGHT-TO-LEFT MARK
  16. 2017 (‗) DOUBLE LOW LINE
  17. 2024 (․) ONE DOT LEADER
  18. 2025 (‥) TWO DOT LEADER
  19. 2026 (…) HORIZONTAL ELLIPSIS
  20. 2028 ( ) LINE SEPARATOR
  21. 2029 ( ) PARAGRAPH SEPARATOR
  22. 202A (<U+202A>) LEFT-TO-RIGHT EMBEDDING
  23. 202B (<U+202B>) RIGHT-TO-LEFT EMBEDDING
  24. 202C (<U+202C>) POP DIRECTIONAL FORMATTING
  25. 202D (<U+202D>) LEFT-TO-RIGHT OVERRIDE
  26. 202E (?) RIGHT-TO-LEFT OVERRIDE
  27. 202F ( ) NARROW NO-BREAK SPACE
  28. 203C (‼) DOUBLE EXCLAMATION MARK
  29. 203E (‾) OVERLINE
  30. 2047 (⁇) DOUBLE QUESTION MARK
  31. 2048 (⁈) QUESTION EXCLAMATION MARK
  32. 2049 (⁉) EXCLAMATION QUESTION MARK
  33. 205F ( ) MEDIUM MATHEMATICAL SPACE
  34. 2061 (⁡) FUNCTION APPLICATION
  35. 2062 (⁢) INVISIBLE TIMES
  36. 2063 (⁣) INVISIBLE SEPARATOR
  37. 2065 (⁥) undefined
  38. 2066 (<U+2066>) LEFT-TO-RIGHT ISOLATE
  39. 2067 (<U+2067>) RIGHT-TO-LEFT ISOLATE
  40. 2068 (<U+2068>) FIRST STRONG ISOLATE
  41. 2069 (<U+2069>) POP DIRECTIONAL ISOLATE
  42. 206A () INHIBIT SYMMETRIC SWAPPING
  43. 206B () ACTIVATE SYMMETRIC SWAPPING
  44. 206C () INHIBIT ARABIC FORM SHAPING
  45. 206D () ACTIVATE ARABIC FORM SHAPING
  46. 206E () NATIONAL DIGIT SHAPES
  47. 206F () NOMINAL DIGIT SHAPES

Ignored

  1. 200B (​) ZERO WIDTH SPACE
  2. 2060 (⁠) WORD JOINER
  3. 2064 (⁤) INVISIBLE PLUS

Mapped

  1. 2010 (‐) HYPHEN2D (-) HYPHEN-MINUS
  2. 2011 (‑) NON-BREAKING HYPHEN2D (-) HYPHEN-MINUS
  3. 2012 (‒) FIGURE DASH2D (-) HYPHEN-MINUS
  4. 2013 (–) EN DASH2D (-) HYPHEN-MINUS
  5. 2014 (—) EM DASH2D (-) HYPHEN-MINUS
  6. 2015 (―) HORIZONTAL BAR2D (-) HYPHEN-MINUS
  7. 2033 (″) DOUBLE PRIME[2032 2032]
  8. 2034 (‴) TRIPLE PRIME[2032 2032 2032]
  9. 2036 (‶) REVERSED DOUBLE PRIME[2035 2035]
  10. 2037 (‷) REVERSED TRIPLE PRIME[2035 2035 2035]
  11. 2057 (⁗) QUADRUPLE PRIME[2032 2032 2032 2032]

Valid

  1. 2016 (‖) DOUBLE VERTICAL LINE
  2. 2018 (‘) LEFT SINGLE QUOTATION MARK
  3. 2019 (’) RIGHT SINGLE QUOTATION MARK
  4. 201A (‚) SINGLE LOW-9 QUOTATION MARK
  5. 201B (‛) SINGLE HIGH-REVERSED-9 QUOTATION MARK
  6. 201C (“) LEFT DOUBLE QUOTATION MARK
  7. 201D (”) RIGHT DOUBLE QUOTATION MARK
  8. 201E („) DOUBLE LOW-9 QUOTATION MARK
  9. 201F (‟) DOUBLE HIGH-REVERSED-9 QUOTATION MARK
  10. 2020 (†) DAGGER
  11. 2021 (‡) DOUBLE DAGGER
  12. 2022 (•) BULLET
  13. 2023 (‣) TRIANGULAR BULLET
  14. 2027 (‧) HYPHENATION POINT
  15. 2030 (‰) PER MILLE SIGN
  16. 2031 (‱) PER TEN THOUSAND SIGN
  17. 2032 (′) PRIME
  18. 2035 (‵) REVERSED PRIME
  19. 2038 (‸) CARET
  20. 2039 (‹) SINGLE LEFT-POINTING ANGLE QUOTATION MARK
  21. 203A (›) SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
  22. 203B (※) REFERENCE MARK
  23. 203D (‽) INTERROBANG
  24. 203F (‿) UNDERTIE
  25. 2040 (⁀) CHARACTER TIE
  26. 2041 (⁁) CARET INSERTION POINT
  27. 2042 (⁂) ASTERISM
  28. 2043 (⁃) HYPHEN BULLET2D (-) HYPHEN-MINUS
  29. 2044 (⁄) FRACTION SLASH
  30. 2045 (⁅) LEFT SQUARE BRACKET WITH QUILL
  31. 2046 (⁆) RIGHT SQUARE BRACKET WITH QUILL
  32. 204A (⁊) TIRONIAN SIGN ET
  33. 204B (⁋) REVERSED PILCROW SIGN
  34. 204C (⁌) BLACK LEFTWARDS BULLET
  35. 204D (⁍) BLACK RIGHTWARDS BULLET
  36. 204E (⁎) LOW ASTERISK
  37. 204F (⁏) REVERSED SEMICOLON
  38. 2050 (⁐) CLOSE UP
  39. 2051 (⁑) TWO ASTERISKS ALIGNED VERTICALLY
  40. 2052 (⁒) COMMERCIAL MINUS SIGN
  41. 2053 (⁓) SWUNG DASH
  42. 2054 (⁔) INVERTED UNDERTIE
  43. 2055 (⁕) FLOWER PUNCTUATION MARK
  44. 2056 (⁖) THREE DOT PUNCTUATION
  45. 2058 (⁘) FOUR DOT PUNCTUATION
  46. 2059 (⁙) FIVE DOT PUNCTUATION
  47. 205A (⁚) TWO DOT PUNCTUATION
  48. 205B (⁛) FOUR DOT MARK
  49. 205C (⁜) DOTTED CROSS
  50. 205D (⁝) TRICOLON
  51. 205E (⁞) VERTICAL FOUR DOTS

  • For mapped, I think we should disallow what I bolded.
  • For valid, I’m not sure. ⁂⁜※ seem cool, I’m not sure about , disallow the rest?
  • Edit: for valid, I think we should disallow what I bolded (and map the hyphen-bullet.)

Frequencies: JSON

Edit: If we disallow and we lose a lot of faces like ◕‿◕.

6 Likes

• would be a confusable for various braille letters

there is also this which could be a confusable for braille:

image

Would be good if there was a way to ban everything, but then select the ones you want, instead of the other way round

I tried this but you run into the same issues. Many symbols just need manually reviewed.

I updated my choices above and included a report. Most of damage are names with apostroph-like characters or leading bullets.

2 Likes

Are you meaning the fraction slash not to be in bold?

Would that not be a confusable with the forward slash on a keyboard which isn’t allowed

image

We discussed 2044 recently. I guess we need more community input.

4 Likes

Unicode 15 officially released. I don’t know if we want to include emojis now or wait until there’s platform support (I say now that they’re official.) I’ll also check what other differences show up when I switch to the latest Unicode data files.

2 Likes

Platform support takes years, Windows still doesn’t support Unicode 14. I say just include them now so we don’t have to release a new version of our libraries and then get ethers/metamask and everyone else to update again etc.

3 Likes
  1. 2019 (’) RIGHT SINGLE QUOTATION MARK should be valid.

Allows for names and surnames to be used on Ethereum NAME Service. (O’Brien, O’Donnell, Kevin O’Leary, etc

6 Likes

Okay, should we enforce:

  • Can’t touch another 2019? eg. ’’’.eth
  • Can’t start the label? End the label?
  • Can’t touch an emoji? ❤️’.eth
    (this would be similar to the combining mark rules)

Since ' [27] has always been disallowed, we could map it to 2019 for UX.

3 Likes
  • Can’t touch another 2019
  • Can’t start or end label
  • Can touch an emoji

Mapping ' [27] to 2019 is perfect.

5 Likes

Where can we see the 250 names with 2044?

3 Likes

Bunch of random updates:

  • During the cleanup of code that derives the spec from the Unicode files and rules we’ve developing, I discovered that there were 3 Modifier_Base emoji that don’t have a Modifier_Base + Modifier RGI-equivalents. I had mistakenly assumed that every combination was RGI. I assume we still want to include them? (via the whitelist)
Emoji Whitelist Additions
// missing MOD_BASE + MODIFIER combinations
// (👪) FAMILY  
'1F46A 1F3FB',
'1F46A 1F3FC',
'1F46A 1F3FD',
'1F46A 1F3FE',
'1F46A 1F3FF',
// (👯) WOMAN WITH BUNNY EARS 
'1F46F 1F3FB',
'1F46F 1F3FC',
'1F46F 1F3FD',
'1F46F 1F3FE',
'1F46F 1F3FF',
// (🤼) WRESTLERS
'1F93C 1F3FB',
'1F93C 1F3FC',
'1F93C 1F3FD',
'1F93C 1F3FE',
'1F93C 1F3FF',
  • During some testing, I discovered that there are 166 characters that when decomposed (either valid or mapped) become 2+ adjacent combining marks. I disallowed all of them since our combining mark rule eliminates them anyway.
    They have very minimal use: JSON

  • For the punctuation discussion above:

    • I mapped 2027 (‧) HYPHENATION POINT to hyphen instead of disallowing it.
    • After looking at prior registrations, I kept 2022 (•) BULLET valid.
  • At the moment, only 2 emoji are disallowed. They are default text-presentation so they format unstyled like !! and !? (but with less kerning). For reference, both ? and ! by themselves are invalid. Nothing prevents them from being valid with how we currently handle emoji.

    • 203C (‼️) double exclamation mark
    • 2049 (⁉️) exclamation question mark
  • I updated everything to Unicode 15 and applied the latest changes.

  • Added code for deriving the spec from Unicode files and ENS-specific rules (example).

  • Added code for generating validation tests from custom examples, generated from derive rules, random names, and registered names.


Edit: Delta Report: ens_normalize (1.6.3) vs ens_normalize (1.6.4) [1484938 labels] @ 2022-09-19T05:21:53.105Z

4 Likes