Limit short domain overcharge to alphanumeric characters to avoid linguistic discrimination

KimIlSung · September 22, 2022, 1:52am

I would like to submit a temp check post about limiting the surcharge on 3 & 4 character long domain names to alphanumerical characters only.

The reason for this is that the current practice has the unintended side effects of discriminating against East Asian names which are overwhelmingly written with 3 to 4 characters. While an American user named Joan Smith can register an ENS for their full name for $5 a Korean user named 김일성 will have to pay 100 times more, and a Japanese user named 山田太郎 30 times more.

The problem extends further to languages as a whole. The average length of an English word is over 5 characters. But CJK words are much shorter so East Asian users are more likely to be overcharged simply for using their own language.

The ENS constitution states that the DAO shall not “unfairly discriminate against name owners’ ability to extend, transfer, or otherwise use their names.” I believe the current practice constitutes an unfair discrimination as it creates a disadvantage for users with non-English names.

Furthermore, the constitution also states that fees are “an incentive mechanism to prevent the namespace becoming overwhelmed with speculatively registered names.” While this is justified in the case of 3-number (hexadecimal or decimal) names that have become speculative NFTs, it does not apply to non-alphanumerical/non-ASCII domain names.

A simple solution would be to limit the overcharge to the standard 256 ASCII characters (possibly extended for emojis) and to exclude Asian characters from it. There are about 80k CJK characters in Unicode, which creates several hundred trillions of possibilities for just 3 character combinations, so clearly concerns about speculation over rarity of these domains are unwarranted.

clowes.eth · September 22, 2022, 9:35am

This is surprisingly interesting and aptly demonstrates the complexities of efficient yet fair pricing economics. I am all in favour of what you are proposing - if what you say is true (and I have no reason to believe it is not) it creates cost frictions for users using other dictionaries.

That said, doing this is surely discriminatory in the opposite way. It would now be a cost friction for users using your standard alphanumeric characters. Realistically there is no way of discerning a ‘fair’ middle ground…

daylon.eth · September 22, 2022, 7:03pm

Alphabet/syllabary/logogram equity is definitely one of the most complex issues that will need to be worked out. I brought this up about 한글 (Korean writing system) when I first started posting here.

nick.eth · September 22, 2022, 9:20pm

The current system is definitely imperfect, but it has to fulfil the constraints of being implementable efficiently onchain. I don’t believe simply limiting it to ASCII - or counting codepoints >256 as more than one codepoint - is a sufficient solution; it would result in pricing a lot of other 3-character domains such as repeated emoji a lot cheaper as well.

KimIlSung · September 23, 2022, 1:06am

I think that’s a fair concern but:

The current length-checking code already iterates byte by byte to determine length based on UTF ranges
CJK characters fall within specific Unicode/UTF ranges that are easy to check for

So you could implement these checks on-chain at the cost of a trivial increase (~2%) in gas costs (details below).

If you look at the current price oracle contract for ENS domains, the price is a function of length which is determined in the following way:

function strlen(string memory s) internal pure returns (uint) {
        uint len;
        uint i = 0;
        uint bytelength = bytes(s).length;
        for(len = 0; i < bytelength; len++) {
            bytes1 b = bytes(s)[i];
            if(b < 0x80) {
                i += 1;
            } else if (b < 0xE0) {
                i += 2;
            } else if (b < 0xF0) {
                i += 3;
            } else if (b < 0xF8) {
                i += 4;
            } else if (b < 0xFC) {
                i += 5;
            } else {
                i += 6;
            }
        }
        return len;
    }

This uses preset ranges to determine the length of a UTF encoded string, although it isn’t exactly perfect. Especially when it comes to emojis. For instance, the function will consider that a 3 emoji long string like " " is of length 5, or that “‍:woman:‍:woman:‍” is of length 6 - meaning that the current short domain overcharges do not apply to these domains. I don’t think it really is a problem as a function properly accounting for these would be more gas expensive.

In any case, while this function is implemented as part of a String library that is used elsewhere in the codebase, there is no reason to use this exact function when determining the premiums applied to short domains. Instead of a string length, we can use the concept of a string “score” that is based both on length + codepoint range that the characters fall in. We can thus modify it slightly to give more weight to CJK characters:


    function stringScore(string memory s) external pure returns (uint) {
        uint len;
        uint i = 0;
        uint bytelength = bytes(s).length;
        for(len = 0; i < bytelength; len++) {
            bytes1 b = bytes(s)[i];
            if(b < 0x80) {
                i += 1;
            } else if (b < 0xE0) {
                i += 2;
            } else if (b < 0xF0) {
                // account for range containing hiragana, katakana
                // chinese characters, hangeul and a few extras
                if (b >= 0xe3 && b <= 0xed) {
                    len += 2;
                }
                i += 3;
            } else if (b < 0xF8) {
                i += 4;
            } else if (b < 0xFC) {
                i += 5;
            } else {
                i += 6;
            }
        }
        return len;
    }

The range used here would capture most commonly used CJK characters. To avoid increasing gas costs too much I haven’t used more granular ranges so this will include other languages like Cherokee and Cham but I think we can safely assume that speculation will be limited on these languages.

With this scoring function the Korean name 김일성 would be scored as 9 and the Japanese name 山田太郎 as 12, effectively negating the premium currently imposed on East Asian languages. Alphanumerical strings like “123” or “aef” remains scored at 3. (Note that the function proposed here is a separate, alternative function to the existing length computing function, so while a 2 character name like 평양 would score 6, it wouldn’t be registerable because it still wouldn’t pass the length check. It would be nice if it did, but that’s a battle for another day )

The gas impact is also negligible. The additional check creates a supplemental cost of 600 gas. The function itself costs roughly 24k gas to call, so it’s only a ~2% increase in gas cost for a function that is already very cheap to call.

And finally regarding @clowes.eth 's comment, I don’t think this would create “reverse discrimination”. The users registering 3 and 4 alphanumerical characters domain are overwhelmingly doing so for speculation on decimal/hexadecimal domains or token names. Most latin alphabet names/words are longer than 4 characters so this will not affect non-speculation use cases.

serenae · September 23, 2022, 2:02am

I disagree with the premise of this proposal itself, so it would be a hard No vote from me.

Adding code into the underlying protocol based on “discrimination” is something I would never want to see the ENS DAO enact, ever.

daylon.eth · September 23, 2022, 2:34am

Maybe discrimination is a bad word to use. One writing system shouldn’t cost more to register because letters/etc are considered “blocks,” based only on the way the roman alphabet interprets them.

But this is a hugely complicated issue. I won’t claim to have any solution.

nick.eth · September 23, 2022, 3:02am

Certainly it’s easy to do for individual special cases. The issue is that if we open that can of worms, we need to somehow compile an exhaustive list of weights for all unicode character sets. Or else, where we were being impartial before (character count is all that matters) we would suddenly be being selective about what names get different scores.

KimIlSung · September 23, 2022, 3:04am

I’m fine with changing the language if it feels inappropriate to you, but the argument remains the same. A Korean syllable block is composed of 2 to 4 letters/characters/jamo (So 김 is really ㄱ+ㅣ+ㅁ), but currently they are counted as one. A similar argument can be made for Chinese characters ( 仁 as 人 + 二). If you make the case that it should be based on the count of Unicode characters, then the current function is inadequate since some emojis are counted as 2 characters.

It seems to me that considerations of fairness should be taken into account by the DAO and in fact already are. The decision to overcharge for short domains is not only motivated by a technical rationale, it’s also fair to prevent domain squatting by raising prices and to allow the DAO to make a profit on speculation generated by these domains. So the ENS DAO’s underlying protocol already implements code based on the same spirit as this proposal. I would also like to point out that my use of the term “discrimination” is in direct reference to the ENS DAO’s constitution - and that therefore the issue certainly does matter a great deal to the protocol.

KimIlSung · September 23, 2022, 3:15am

I wouldn’t call the current method “impartial” since it clearly affects different users very differently. I believe the rationale for overcharging short domain names is tied to their speculative use and scarcity. That is fine and taking measures to prevent domain squatting and generate extra profit for the DAO is fine

However to be “impartial” these measures should only do what they set out to do. There’s little scarcity in 3 letter domain names when you include non-alphanumeric characters and consequently little speculation as well. So the current measures are overreaching in a way that is not impartial.

I understand the “can of worms” argument and we certainly don’t want to see this devolve in spaghetti code contracts with thousands of edge cases. However:

One could argue that the “can of worms” was already opened when it was decided to apply selective pricing based on string length alone, and therefore the best and most simple solution would be to remove the surcharge on all short domains.
Or we can stick to a simple solution where we only apply the rule to alphanumerical chars. No need to worry about language specific edge-cases. Sure the DAO won’t be overcharging for emoji domains anymore but the emoji string length evaluation is already broken and these domains aren’t exactly useful, so having squatters speculating on them isn’t much of a detriment to the larger ecosystem.

nick.eth · September 23, 2022, 3:49am

It’s impartial because 3 characters is 3 characters, regardless of your alphabet. Assigning different codepages different weights would require a lot of justification, and be open to substantial debate - eg, on what the relative information density of a Chinese character is compared to a Thai character, etc etc.

AvsA · September 23, 2022, 3:39pm

This is a very interesting discussion. At first glance I would say that the length is really a proxy for rarity: there are after all only 36^4 possible letter combinations using alphanumeric letters. On the other hand there are thousands (or depending on how you count, up to hundred thousand) Chinese characters and it makes sense that these are a lot less “scarce” and should be cheaper.

But as others mentioned, opening exceptions for specific languages is both technically and culturally a can of worms. The Korean, Hebrew, Cyrillic and Arabic alphabet have about the same number of letters than the “standard” alphanumeric, so how should they be priced against Chinese or Emoji domains? What about the Hindi alphabet which has 46 letters? What if you mix and match alphabets?

Something we could consider would be to price names according to the amount of registered names at that length: after the first X 3-char names are registered then the price starts to decrease at a given rate. Same for other character lengths. This would work opposite of a bonding curve, in which the price of an item increases as more of them are registered. In this scenario the premium basically decreases once the very popular names are already bought, and then it starts going down the ladder of less popular names and character sets.

I am a bit concerned on what the incentives would be there. In a normal bonding curve, there’s a tendency for the good to be sold or bought until it has reached the market price. But as it’s the opposite curve here, we’d need to really think about if the incentives would make sense.

daylon.eth · September 23, 2022, 4:20pm

That’s the thing though. In Korean (all I can speak for), anywhere from 1-4 letters/characters are combined into each syllable “block” in its writing system. They are still individual “characters” as they are letters, but they are combined to be read more efficiently. So for example, someone’s full name could be made up of far more than three letters, but is rendered as only 3 “blocks” because that’s how the writing system works. Many other systems are similar.

KimIlSung · September 23, 2022, 10:03pm

Fair enough, but then if we want a universal standard why not make it 3/4 bytes instead of 3/4 characters? Using the bytelength instead of string length would effectively limit overcharges to 3/4 ascii letter domains. It also makes sense since bytes are a much better gauge of scarcity than unicode characters (there really are only 255^3 3-byte combinations, something that can’t be said for 3 unicode character combinations). And it’s easier to implement on chain and a lot more gas-efficient than checking for string length.

While I used examples from East Asian language, I realize that it may have been unproductive. As you and others have pointed out we don’t want to fall down the rabbit hole of taking into account every language’s specifics.

But bytelength would offer a one-size-fits-all, give more gas benefits and produce a fairer outcome.

An inverse bounding curve sounds super interesting but could it really be implemented post hoc? For instance, how would it account for renewal fees and the fact that most popular domains are have already been registered? Since popular 3 letter name domains have all been snatched and there are now a lot of domains in that character range, renewal would be cheap. On the other hand, few people would use the 16+ character domain space. So I think there might be a risk that we would see concentrated usage on the popular character ranges which remain cheap with no incentives to use the less popular ranges. (To get back to the linguistic issue, it would also make ENS domains more expensive for speakers of languages with longer-than-English average word length).

I wonder if we can’t leverage the time/speculation relationship in a different way by having time weighted pricing. All domains start at a relatively high (but not prohibitively so) price, regardless of their length and the price progressively decreases towards a floor. This creates an actual relationship between demand and price, since in-demand domains are likely to be snatched fast while others will remain on the market. Because it is not bound to character length, it can also potentially generate more revenue for the DAO.

For renewals, domain owners can choose to pay the high starting price for renewal, or let the domain go back to the market and wait for it to decrease at the risk of someone else snatching it. Note that this scheme can potentially work in tandem with the current string length short domain overcharge to make starting prices higher on short domains.

ricmoo · September 24, 2022, 3:46am

This is an interesting problem, and does need to weigh complexity against fairness. My first thought was byte length, but this also seems non-ideal. I wonder if something simple (for scoring) could simply be, ascii7 count as 1 point, and anything that is multibyte counts as 2.

This would put 3-letter Hangul names at 6 points. Then we keep the existing pricing schedule, except based on points rather than characters. Kanji and kana names also scale at “popularity”, and Chinese follow kanji distribution.

It means prices of emoji names come down, but since emoji are largely less distinct, this may also seem fair.

The rough price schedule we are trying to target is “how memory-compressible is a name”, so nothing will ever be perfect (E.g. “BigCow” is far more memorable than “w8blpq” despite being the same length, so should prolly be “worth” more).

But since most characters outside ascii7 are already less “popular” and even in languages with latin1-alphabet-sized alphabets, names are less “popular”, it might be the easiest way to catch the most number of cases as un-ineffectively as possible. Be less-worse is really the goal, for now.

Just a thought. I’m very interested in this topic though.

nick.eth · September 28, 2022, 3:18am

KimIlSung:

Fair enough, but then if we want a universal standard why not make it 3/4 bytes instead of 3/4 characters? Using the bytelength instead of string length would effectively limit overcharges to 3/4 ascii letter domains. It also makes sense since bytes are a much better gauge of scarcity than unicode characters (there really are only 255^3 3-byte combinations, something that can’t be said for 3 unicode character combinations). And it’s easier to implement on chain and a lot more gas-efficient than checking for string length.

While I used examples from East Asian language, I realize that it may have been unproductive. As you and others have pointed out we don’t want to fall down the rabbit hole of taking into account every language’s specifics.

But bytelength would offer a one-size-fits-all, give more gas benefits and produce a fairer outcome.

I think this trades one ‘unfairness’ off for another. Someone who speaks Croatian might reasonably ask why they can’t register Ǆ.eth (2 bytes), while their Vietnamese friend can register Ạ.eth (3 bytes). Charging based on UTF-8 byte length treats lower codepoints as ‘rarer’ and hence more expensive.

KimIlSung · September 28, 2022, 5:49am

I think that this is a different discussion, I am not suggesting altering the minimum length criteria for the validity of domain names - although I would certainly be happy to discuss the issue in a different thread. To get back to your example, a Croatian user named Božo or Ruža would surely be happy to pay $5 for božo.eth or ruža.eth instead of $160 because the pricing is based on bytelength (5) instead of string length (4). It may not be a perfectly fair solution, it is still fairer than the one in place now.

Agreed, it seems to me that we can easily deploy a less than ideal but still better than status quo solution with the understanding that there is still much room for improvement.

Ideally what we want to go after is demand rather than just memorability (ilikeflowers.eth is more memorable than 384.eth but only one of the two is in demand) and the only way to reasonably measure that is to have market with a bidding system. Ultimately I think this is the solution we really want to be aiming for.

But it would require more discussion, more overhead and more time to delivery and I’m not sure how ready the DAO is to adopt a bidding-based system (the fact that there isn’t one in place already seems to indicate that there’s a reluctance to make the switch to a more complex pricing system). I’m happy to continue the discussion for such potential solutions, do toy implementation, simulations, etc. But given the long (and still uncertain) timeline to delivery, having a decent patch such as the bytelength criteria in place would provide an easy temporary improvement over the current system.

nick.eth · September 28, 2022, 7:58pm

If we still determine minimum length based on characters, we’ll have to continue counting characters for that purpose - so your previous mention of the gas-efficiency of byte counting is moot.

Counting unicode codepoints would also mean all emoji names are considered 5+ characters and thus minimal price - but there’s definitely scarcity there.

The root of my discomfort with using UTF-8 byte length as a proxy for cost is that the length, and the ordering of unicode codepoints, is more or less an accident. It’s not a particularly robust way of determining rarity.

KimIlSung · September 29, 2022, 12:51am

Unless I’m mistaken the validation of the domain’s length is done separately from the pricing. So there are right now two separate calls that compute the length of the domain name. One is part of checking whether the domain is available and is done by the ETHRegistrarController. The second one is executed by the price oracle contract as I mentioned earlier. So the gas savings are on the second call to the oracle contract when calculating the price of the domain. (I’m not arguing there are great gas savings to be had btw, just that the method is more gas efficient than the current one in addition to its other advantages)

Emoji names are already evaluated inconsistently, i gave a couple examples in a previous post where 3 emoji long domains are evaluated as having a string length of 5 or 6. This also allows people to register single emojis as domain when they shouldn’t.

Furthermore it’s not unlikely that there is more revenue to be gained by the DAO by increasing overall demand with a lowering of the price on 3-byte-long domains (by making it affordable for East Asian people to register their name for instance) than by increasing the price on a few speculative domains.

I agree with you here, but string length is not a robust way of determining rarity either. My point is simply that bytelength, while suboptimal, is still better than the current string length based method.

No heuristic will ever be able to even approximate the rarity of a domain. If we truly want scarcity based pricing, we need a mechanism that directly accounts for demand and lets users place bids.

Cannabusiness.eth · September 30, 2022, 3:19pm

The Hangook example is clever

But using rulesets for say transitive sequences (in Korean it’s consonant vowel consonant vowel) in language resolution you will find 3 characters can always be 3 characters

(A, A/) (A,B) (B, B/) (B, C/)

Therefore (A, C/)

I’m drunk but much is right.

3 is 3