I think that’s a fair concern but:
- The current length-checking code already iterates byte by byte to determine length based on UTF ranges
- CJK characters fall within specific Unicode/UTF ranges that are easy to check for
So you could implement these checks on-chain at the cost of a trivial increase (~2%) in gas costs (details below).
If you look at the current price oracle contract for ENS domains, the price is a function of length which is determined in the following way:
function strlen(string memory s) internal pure returns (uint) {
uint len;
uint i = 0;
uint bytelength = bytes(s).length;
for(len = 0; i < bytelength; len++) {
bytes1 b = bytes(s)[i];
if(b < 0x80) {
i += 1;
} else if (b < 0xE0) {
i += 2;
} else if (b < 0xF0) {
i += 3;
} else if (b < 0xF8) {
i += 4;
} else if (b < 0xFC) {
i += 5;
} else {
i += 6;
}
}
return len;
}
This uses preset ranges to determine the length of a UTF encoded string, although it isn’t exactly perfect. Especially when it comes to emojis. For instance, the function will consider that a 3 emoji long string like " " is of length 5, or that “:woman::woman:” is of length 6 - meaning that the current short domain overcharges do not apply to these domains. I don’t think it really is a problem as a function properly accounting for these would be more gas expensive.
In any case, while this function is implemented as part of a String library that is used elsewhere in the codebase, there is no reason to use this exact function when determining the premiums applied to short domains. Instead of a string length, we can use the concept of a string “score” that is based both on length + codepoint range that the characters fall in. We can thus modify it slightly to give more weight to CJK characters:
function stringScore(string memory s) external pure returns (uint) {
uint len;
uint i = 0;
uint bytelength = bytes(s).length;
for(len = 0; i < bytelength; len++) {
bytes1 b = bytes(s)[i];
if(b < 0x80) {
i += 1;
} else if (b < 0xE0) {
i += 2;
} else if (b < 0xF0) {
// account for range containing hiragana, katakana
// chinese characters, hangeul and a few extras
if (b >= 0xe3 && b <= 0xed) {
len += 2;
}
i += 3;
} else if (b < 0xF8) {
i += 4;
} else if (b < 0xFC) {
i += 5;
} else {
i += 6;
}
}
return len;
}
The range used here would capture most commonly used CJK characters. To avoid increasing gas costs too much I haven’t used more granular ranges so this will include other languages like Cherokee and Cham but I think we can safely assume that speculation will be limited on these languages.
With this scoring function the Korean name 김일성 would be scored as 9 and the Japanese name 山田太郎 as 12, effectively negating the premium currently imposed on East Asian languages. Alphanumerical strings like “123” or “aef” remains scored at 3. (Note that the function proposed here is a separate, alternative function to the existing length computing function, so while a 2 character name like 평양 would score 6, it wouldn’t be registerable because it still wouldn’t pass the length check. It would be nice if it did, but that’s a battle for another day )
The gas impact is also negligible. The additional check creates a supplemental cost of 600 gas. The function itself costs roughly 24k gas to call, so it’s only a ~2% increase in gas cost for a function that is already very cheap to call.
And finally regarding @clowes.eth 's comment, I don’t think this would create “reverse discrimination”. The users registering 3 and 4 alphanumerical characters domain are overwhelmingly doing so for speculation on decimal/hexadecimal domains or token names. Most latin alphabet names/words are longer than 4 characters so this will not affect non-speculation use cases.