/ [2044]
is on my “I’m not sure” list. It’s used in IDNA 2003 (and 2008) mapping for fractions (like ½
). So if we disallow 2044
, we also need to remove 20 mappings.
As of recently, there are about 250 names with 2044
.
/ [2044]
is on my “I’m not sure” list. It’s used in IDNA 2003 (and 2008) mapping for fractions (like ½
). So if we disallow 2044
, we also need to remove 20 mappings.
As of recently, there are about 250 names with 2044
.
@raffy trivial:
import {ens_beautify, ens_normalize} from "@adraffy/ens_normalize";
will not resolve ens_beautify, but:
import {ens_beautify, ens_normalize} from "@adraffy/ens_normalize/src/lib";
will. Tested on Angular 2.0 v13.4
I incorporated these changes into the ENSIP and updated the corresponding libraries.
Damage report: ens_normalize (1.6.0) vs eth-ens-namehash (2.0.15) [1463921 labels] @ 2022-09-01T09:50:38.018Z
There’s some Hebrew, Devanagari, Thai names that are collateral damage from the CM change. It’s unclear if these are legitimate names or if they’re just decorated with superfluous CM.
w/r/t Unicode NF support: Node 14 with Unicode 13 now runs the validation tests successfully. Node 12 with Unicode 11 has one 1 error (on a decomposable codepoint that was added in Unicode 13.)
To implement the Emoji + CM
rule, the emoji.json
now includes single-codepoint default text-presentation emoji (they used to be in chars.json
and were considered characters.) Ultimately, I think this is an improvement, because they are emoji, but it potentially breaks the assumption that all emoji are safe – instead, now all emoji of 2+ codepoints are safe.
I also removed the following CMs which appear to be underscore-like:
cm-tally.json
320 (x̠) Combining Minus Sign Below
332 (x̲) Combining Low Line
333 (x̳) Combining Double Low Line
347 (x͇) Combining Equals Sign Below
FE2B (x︫) Combining Macron Left Half Below
FE2C (x︬) Combining Macron Right Half Below
FE2D (x︭) Combining Conjoining Macron Below
Also, josiahadams brought up an issue about substring matching, which requires partial normalization. I added some comments to the ENSIP about fragments and included an example which uses ens_normalize_fragment()
which is nearly-equivalent to the Processing step in the ENSIP. For example, this would let you prepare the fragment [303 39 FE0F 20E3]
and "aa--"
(which both fail full normalization.)
ENS Rego Bot isn’t picking up on this and just showing 0.eth
Had to go through ENSvisions recent regos to get how it’s being done
Another one:
I wasn’t aware specifically of Horizontal Scan Line-#
but I’m aware there are many symbols that require opinionated review.
I agree ⎺ ⎻ ⎼ ⎽ ⏤ ⎯ [23BA 23BB 23BC 23BD 23E4 23AF]
need mapped to hyphen or disallowed. They should of been part of the hyphen mapping. Edit I say we (1) either disallow all 6, (2) disallow the first 4 and map the last 2 to hyphen, (3) map all to hyphen.
The policy should probably be: there’s one underscore, there’s many things that map to hyphen, or it’s disallowed.
IMO, the following nearby codepoints could be disallowed too:
⎀⎁⎂⎃
— Latin-like⌿⍀⍳⍸⍴⍷⍵⍹⍺⍶
— APL Letter Like⍡⍢⍣⍤⍥⍨⍩
— APL + Double Dot⍫⍬⍭
— APL + Tidle⍪⍮⍘⍙⍚⍛⍜
— APL + Hyphen/Underscore⎜⎟⎢⎥⎨⎪⎬⎮⎸⎹⍿
— Vertical Baraa⏜aa⏝aa⏞aa⏟aa⏠aa⏡aa
– Rotated Brackets (ignore the a’s)I mapped all 6 of those characters to hyphen (option 3.) While it’s best to disallow characters so they can be revisited later (mapping should be used sparingly), I see no future for alternative hyphens and mapping is much better than the alternative.
I left the others unchanged. They’re low frequency usage at the moment JSON.
The latest version of ens_normalize.js includes NF 14.0.0
ending any Unicode issues. I was able to run 100% validation tests (unmodified) on Node 11 (which is on Unicode 11.) I also removed globalThis.atob()
dependency (h/t makoto)
There’s now an ens_emoji()
function which returns all of the emoji sequences.
I was incorrect here. There are some confusable emoji that eventually need to be reviewed regarding if some kind of warning is appropriate:
🚴♂🚴🏻🚴 Gender:[1F6B4 200D 2642] vs Skin:[1F6B4 1F3FB] vs Singular:[1F6B4]
🇺🇸 🇺🇲 US:[1F1FA 1F1F8] vs UM:[1F1FA 1F1F2]
For the flag cases, the beautifier now inserts 200B
between regional indicators, which prevents them collapsing into flags. This makes US
vs UM
obvious and works for all similar flag-related issues.
Delta Report: ens_normalize (1.6.1) vs ens_normalize (1.6.3) [1473529 labels] @ 2022-09-09T08:57:10.840Z
U+0027 disallowed but U+2018 & U+2019 allowed
These are going to be a confusable to each other let alone ‘
Thanks @Theth.eth. Agreed, a bunch of that punctuation block should be disallowed.
Disallowed
2000 ( ) EN QUAD
2001 ( ) EM QUAD
2002 ( ) EN SPACE
2003 ( ) EM SPACE
2004 ( ) THREE-PER-EM SPACE
2005 ( ) FOUR-PER-EM SPACE
2006 ( ) SIX-PER-EM SPACE
2007 ( ) FIGURE SPACE
2008 ( ) PUNCTUATION SPACE
2009 ( ) THIN SPACE
200A ( ) HAIR SPACE
200C () ZERO WIDTH NON-JOINER
200D () ZERO WIDTH JOINER
200E () LEFT-TO-RIGHT MARK
200F () RIGHT-TO-LEFT MARK
2017 (‗) DOUBLE LOW LINE
2024 (․) ONE DOT LEADER
2025 (‥) TWO DOT LEADER
2026 (…) HORIZONTAL ELLIPSIS
2028 ( ) LINE SEPARATOR
2029 ( ) PARAGRAPH SEPARATOR
202A (<U+202A>) LEFT-TO-RIGHT EMBEDDING
202B (<U+202B>) RIGHT-TO-LEFT EMBEDDING
202C (<U+202C>) POP DIRECTIONAL FORMATTING
202D (<U+202D>) LEFT-TO-RIGHT OVERRIDE
202E (?) RIGHT-TO-LEFT OVERRIDE
202F ( ) NARROW NO-BREAK SPACE
203C (‼) DOUBLE EXCLAMATION MARK
203E (‾) OVERLINE
2047 (⁇) DOUBLE QUESTION MARK
2048 (⁈) QUESTION EXCLAMATION MARK
2049 (⁉) EXCLAMATION QUESTION MARK
205F ( ) MEDIUM MATHEMATICAL SPACE
2061 () FUNCTION APPLICATION
2062 () INVISIBLE TIMES
2063 () INVISIBLE SEPARATOR
2065 () undefined
2066 (<U+2066>) LEFT-TO-RIGHT ISOLATE
2067 (<U+2067>) RIGHT-TO-LEFT ISOLATE
2068 (<U+2068>) FIRST STRONG ISOLATE
2069 (<U+2069>) POP DIRECTIONAL ISOLATE
206A () INHIBIT SYMMETRIC SWAPPING
206B () ACTIVATE SYMMETRIC SWAPPING
206C () INHIBIT ARABIC FORM SHAPING
206D () ACTIVATE ARABIC FORM SHAPING
206E () NATIONAL DIGIT SHAPES
206F () NOMINAL DIGIT SHAPES
Ignored
200B () ZERO WIDTH SPACE
2060 () WORD JOINER
2064 () INVISIBLE PLUS
Mapped
2010 (‐) HYPHEN
→ 2D (-) HYPHEN-MINUS
2011 (‑) NON-BREAKING HYPHEN
→ 2D (-) HYPHEN-MINUS
2012 (‒) FIGURE DASH
→ 2D (-) HYPHEN-MINUS
2013 (–) EN DASH
→ 2D (-) HYPHEN-MINUS
2014 (—) EM DASH
→ 2D (-) HYPHEN-MINUS
2015 (―) HORIZONTAL BAR
→ 2D (-) HYPHEN-MINUS
2033 (″) DOUBLE PRIME
→ [2032 2032]
2034 (‴) TRIPLE PRIME
→ [2032 2032 2032]
2036 (‶) REVERSED DOUBLE PRIME
→ [2035 2035]
2037 (‷) REVERSED TRIPLE PRIME
→ [2035 2035 2035]
2057 (⁗) QUADRUPLE PRIME
→ [2032 2032 2032 2032]
Valid
2016 (‖) DOUBLE VERTICAL LINE
2018 (‘) LEFT SINGLE QUOTATION MARK
2019 (’) RIGHT SINGLE QUOTATION MARK
201A (‚) SINGLE LOW-9 QUOTATION MARK
201B (‛) SINGLE HIGH-REVERSED-9 QUOTATION MARK
201C (“) LEFT DOUBLE QUOTATION MARK
201D (”) RIGHT DOUBLE QUOTATION MARK
201E („) DOUBLE LOW-9 QUOTATION MARK
201F (‟) DOUBLE HIGH-REVERSED-9 QUOTATION MARK
2020 (†) DAGGER
2021 (‡) DOUBLE DAGGER
2022 (•) BULLET
2023 (‣) TRIANGULAR BULLET
2027 (‧) HYPHENATION POINT
2030 (‰) PER MILLE SIGN
2031 (‱) PER TEN THOUSAND SIGN
2032 (′) PRIME
2035 (‵) REVERSED PRIME
2038 (‸) CARET
2039 (‹) SINGLE LEFT-POINTING ANGLE QUOTATION MARK
203A (›) SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
203B (※) REFERENCE MARK
203D (‽) INTERROBANG
203F (‿) UNDERTIE
2040 (⁀) CHARACTER TIE
2041 (⁁) CARET INSERTION POINT
2042 (⁂) ASTERISM
2043 (⁃) HYPHEN BULLET
→ 2D (-) HYPHEN-MINUS
2044 (⁄) FRACTION SLASH
2045 (⁅) LEFT SQUARE BRACKET WITH QUILL
2046 (⁆) RIGHT SQUARE BRACKET WITH QUILL
204A (⁊) TIRONIAN SIGN ET
204B (⁋) REVERSED PILCROW SIGN
204C (⁌) BLACK LEFTWARDS BULLET
204D (⁍) BLACK RIGHTWARDS BULLET
204E (⁎) LOW ASTERISK
204F (⁏) REVERSED SEMICOLON
2050 (⁐) CLOSE UP
2051 (⁑) TWO ASTERISKS ALIGNED VERTICALLY
2052 (⁒) COMMERCIAL MINUS SIGN
2053 (⁓) SWUNG DASH
2054 (⁔) INVERTED UNDERTIE
2055 (⁕) FLOWER PUNCTUATION MARK
2056 (⁖) THREE DOT PUNCTUATION
2058 (⁘) FOUR DOT PUNCTUATION
2059 (⁙) FIVE DOT PUNCTUATION
205A (⁚) TWO DOT PUNCTUATION
205B (⁛) FOUR DOT MARK
205C (⁜) DOTTED CROSS
205D (⁝) TRICOLON
205E (⁞) VERTICAL FOUR DOTS
⁂⁜※
seem cool, I’m not sure about •
, disallow the rest?Frequencies: JSON
Edit: If we disallow ‿
and ⁔
we lose a lot of faces like ◕‿◕
.
• would be a confusable for various braille letters
there is also this which could be a confusable for braille:
Would be good if there was a way to ban everything, but then select the ones you want, instead of the other way round
I tried this but you run into the same issues. Many symbols just need manually reviewed.
I updated my choices above and included a report. Most of damage are names with apostroph-like characters or leading bullets.
Are you meaning the fraction slash not to be in bold?
Would that not be a confusable with the forward slash on a keyboard which isn’t allowed
Unicode 15 officially released. I don’t know if we want to include emojis now or wait until there’s platform support (I say now that they’re official.) I’ll also check what other differences show up when I switch to the latest Unicode data files.
Platform support takes years, Windows still doesn’t support Unicode 14. I say just include them now so we don’t have to release a new version of our libraries and then get ethers/metamask and everyone else to update again etc.
2019 (’) RIGHT SINGLE QUOTATION MARK
should be valid.Allows for names and surnames to be used on Ethereum NAME Service. (O’Brien, O’Donnell, Kevin O’Leary, etc
Okay, should we enforce:
2019
? eg. ’’’.eth
❤️’.eth
Since ' [27]
has always been disallowed, we could map it to 2019
for UX.
2019
Mapping ' [27]
to 2019
is perfect.
Where can we see the 250 names with 2044
?
Bunch of random updates:
Modifier_Base
emoji that don’t have a Modifier_Base + Modifier
RGI-equivalents. I had mistakenly assumed that every combination was RGI. I assume we still want to include them? (via the whitelist)// missing MOD_BASE + MODIFIER combinations
// (👪) FAMILY
'1F46A 1F3FB',
'1F46A 1F3FC',
'1F46A 1F3FD',
'1F46A 1F3FE',
'1F46A 1F3FF',
// (👯) WOMAN WITH BUNNY EARS
'1F46F 1F3FB',
'1F46F 1F3FC',
'1F46F 1F3FD',
'1F46F 1F3FE',
'1F46F 1F3FF',
// (🤼) WRESTLERS
'1F93C 1F3FB',
'1F93C 1F3FC',
'1F93C 1F3FD',
'1F93C 1F3FE',
'1F93C 1F3FF',
During some testing, I discovered that there are 166 characters that when decomposed (either valid or mapped) become 2+ adjacent combining marks. I disallowed all of them since our combining mark rule eliminates them anyway.
They have very minimal use: JSON
For the punctuation discussion above:
2027 (‧) HYPHENATION POINT
to hyphen instead of disallowing it.2022 (•) BULLET
valid.At the moment, only 2 emoji are disallowed. They are default text-presentation so they format unstyled like !!
and !?
(but with less kerning). For reference, both ?
and !
by themselves are invalid. Nothing prevents them from being valid with how we currently handle emoji.
203C (‼️) double exclamation mark
2049 (⁉️) exclamation question mark
I updated everything to Unicode 15 and applied the latest changes.
Added code for deriving the spec from Unicode files and ENS-specific rules (example).
Added code for generating validation tests from custom examples, generated from derive rules, random names, and registered names.
Edit: Delta Report: ens_normalize (1.6.3) vs ens_normalize (1.6.4) [1484938 labels] @ 2022-09-19T05:21:53.105Z