Here are some thoughts:
Assume max label length fits in 16 bits (or w/e). The low level normalize should be string → (string norm
, string normNoEmoji
, uint256 labelData
) where:
normNoEmoji
is the same string but each emoji sequence is zero’d
eg.(4 bytes) →
[0,0,0,0]
- each
labelData
is:- 16 bit → label length
- 240 bits → bitset of active non-emoji codepoints shifted by 14
(2^21/240 => 14 bits)
From that, it’s easy to compute the namehash, extract out any label, compute any label hash, or quickly determine which validation could apply. eg. the basic validator (DNS + Emoji) only runs if the bitset is 0x1. normNoEmoji
avoids processing emoji again during validation.
The primary normalize function should be string → (string, hash) like you describe.
I think all the charset stuff should exist in the validator, whether that’s the same contract or a different contract, I’m not sure. Most validator checks are per-label, so given (label, bitset), you can efficiently check if its valid. IIRC, only check bidi requires a full-name check.
The low level validation function would be (string, bitset) → bool like:
function validate(string label, uint256 bitset) returns (bool) {
// any 0 byte is a previously processed emoji
if (bitset == 1) { // only codepoints [0, 0x4000)
// check if basic
// check if latin non-confusable
// check if greek w/o wholescript confusable
// etc...
}
return false;
}
Or more cleanly, the validation contract could have arbitrary functions key’d by bitset + nonce, so you could register/unregister any number of validation functions, and quickly check which ones could apply by intersecting the label bitset, etc.
The primary normalize + validation function would be string → (string norm
, uint256 hash
, bool valid
) where:
given name
(norm, normNoEmoji, labelData) = normalize(name)
start = 0
hash = 0
valid = true
// possibly apply full-name check
for [len, bitset] of labelData.reverse()
hash = keccak(hash + keccak(norm.slice(start, len)))
valid &= validate(normNoEmoji.slice(start, len), bitset)
start += 1 + len
return (norm, hash, valid)
- If this throws, the name is invalid
- If
valid
is false, the user should get some kind of warning that the name is potentially unsafe (where unsafe means one or more labels satisfied 0 approved validators.) - If
valid
isn’t needed, use the primary normalize function instead.