Yes.
After some thought, the presence of an emoji tag sequence should simply terminate the emoji parsing, rather than consuming the tag and ignoring it. The tag sequence would then be processed (and rejected) by IDNA 2008. Ignoring is bad because you need to differentiate Flag
from Flag+TagSequence
since Flag
could combine with something else.
There are only (3) RGI tag sequences in the Unicode set (each following a black flag emoji.) Maybe they should be whitelisted, as 🏴.eth
is currently owned (but vulnerable to spoofing on nonsupporting platforms as the tag sequence renders invisibly.)
So the whitelist logic would be:
SEQ = list of allowed complete sequences (3 RGI)
ZWJ = list of allowed ZWJ sequences (1349 RGI + 0-75 non-RGI)
- Find the longest SEQ that exactly matches the characters.
- If it exists, produce an emoji token and goto 1 (this handles the flag + tag sequences)
- Parse the characters according to UTS-51, where ZWJs can join if they form a whitelisted sequence.
- If an emoji was found, produce and emoji token and goto 1.
- Parse the character according to UTS-46 and goto 1.
With this logic, an unsupported ZWJ sequence will terminate before a ZWJ, which will then go through UTS-46 and get rejected by ContextJ. An unsupported SEQ sequence is just parsed normally by UTS-51 and UTS-46 (and likely rejected.)
I updated my library to support this logic. I also whitelisted the 3 RGI tag sequences and added some non-RGI ZWJ sequences as a test.