There’s some Hebrew, Devanagari, Thai names that are collateral damage from the CM change. It’s unclear if these are legitimate names or if they’re just decorated with superfluous CM.
w/r/t Unicode NF support: Node 14 with Unicode 13 now runs the validation tests successfully. Node 12 with Unicode 11 has one 1 error (on a decomposable codepoint that was added in Unicode 13.)
To implement the Emoji + CM rule, the emoji.json now includes single-codepoint default text-presentation emoji (they used to be in chars.json and were considered characters.) Ultimately, I think this is an improvement, because they are emoji, but it potentially breaks the assumption that all emoji are safe – instead, now all emoji of 2+ codepoints are safe.
I also removed the following CMs which appear to be underscore-like: cm-tally.json
320 (x̠) Combining Minus Sign Below
332 (x̲) Combining Low Line
333 (x̳) Combining Double Low Line
347 (x͇) Combining Equals Sign Below
FE2B (x︫) Combining Macron Left Half Below
FE2C (x︬) Combining Macron Right Half Below
FE2D (x︭) Combining Conjoining Macron Below
Also, josiahadams brought up an issue about substring matching, which requires partial normalization. I added some comments to the ENSIP about fragments and included an example which uses ens_normalize_fragment() which is nearly-equivalent to the Processing step in the ENSIP. For example, this would let you prepare the fragment [303 39 FE0F 20E3] and "aa--" (which both fail full normalization.)
I wasn’t aware specifically of Horizontal Scan Line-# but I’m aware there are many symbols that require opinionated review.
I agree ⎺ ⎻ ⎼ ⎽ ⏤ ⎯ [23BA 23BB 23BC 23BD 23E4 23AF] need mapped to hyphen or disallowed. They should of been part of the hyphen mapping. Edit I say we (1) either disallow all 6, (2) disallow the first 4 and map the last 2 to hyphen, (3) map all to hyphen.
The policy should probably be: there’s one underscore, there’s many things that map to hyphen, or it’s disallowed.
IMO, the following nearby codepoints could be disallowed too:
⎀⎁⎂⎃ — Latin-like
⌿⍀⍳⍸⍴⍷⍵⍹⍺⍶ — APL Letter Like
⍡⍢⍣⍤⍥⍨⍩ — APL + Double Dot
⍫⍬⍭ — APL + Tidle
⍪⍮⍘⍙⍚⍛⍜ — APL + Hyphen/Underscore
⎜⎟⎢⎥⎨⎪⎬⎮⎸⎹⍿ — Vertical Bar
aa⏜aa⏝aa⏞aa⏟aa⏠aa⏡aa – Rotated Brackets (ignore the a’s)
I mapped all 6 of those characters to hyphen (option 3.) While it’s best to disallow characters so they can be revisited later (mapping should be used sparingly), I see no future for alternative hyphens and mapping is much better than the alternative.
I left the others unchanged. They’re low frequency usage at the moment JSON.
The latest version of ens_normalize.js includes NF 14.0.0 ending any Unicode issues. I was able to run 100% validation tests (unmodified) on Node 11 (which is on Unicode 11.) I also removed globalThis.atob() dependency (h/t makoto)
There’s now an ens_emoji() function which returns all of the emoji sequences.
I was incorrect here. There are some confusable emoji that eventually need to be reviewed regarding if some kind of warning is appropriate:
🚴♂🚴🏻🚴 Gender:[1F6B4 200D 2642] vs Skin:[1F6B4 1F3FB] vs Singular:[1F6B4]
🇺🇸 🇺🇲 US:[1F1FA 1F1F8] vs UM:[1F1FA 1F1F2]
For the flag cases, the beautifier now inserts 200B between regional indicators, which prevents them collapsing into flags. This makes US vs UM obvious and works for all similar flag-related issues.
Unicode 15 officially released. I don’t know if we want to include emojis now or wait until there’s platform support (I say now that they’re official.) I’ll also check what other differences show up when I switch to the latest Unicode data files.
Platform support takes years, Windows still doesn’t support Unicode 14. I say just include them now so we don’t have to release a new version of our libraries and then get ethers/metamask and everyone else to update again etc.
During the cleanup of code that derives the spec from the Unicode files and rules we’ve developing, I discovered that there were 3 Modifier_Base emoji that don’t have a Modifier_Base + Modifier RGI-equivalents. I had mistakenly assumed that every combination was RGI. I assume we still want to include them? (via the whitelist)
During some testing, I discovered that there are 166 characters that when decomposed (either valid or mapped) become 2+ adjacent combining marks. I disallowed all of them since our combining mark rule eliminates them anyway.
They have very minimal use: JSON
For the punctuation discussion above:
I mapped 2027 (‧) HYPHENATION POINT to hyphen instead of disallowing it.
After looking at prior registrations, I kept 2022 (•) BULLET valid.
At the moment, only 2 emoji are disallowed. They are default text-presentation so they format unstyled like !! and !? (but with less kerning). For reference, both ? and ! by themselves are invalid. Nothing prevents them from being valid with how we currently handle emoji.
203C (‼️) double exclamation mark
2049 (⁉️) exclamation question mark
I updated everything to Unicode 15 and applied the latest changes.