Bad Cidv1 Tor/Onion3 contenthash format

tldr;
Current onion3 contenthash implementation is NOT following CIDv1 properly.

From test files of ENS/contatn-hash:

Onion3 multiaddr/multiformat codecs

here,
varint(0x01bd) = prefix bd03, which is a multiaddr.
Same prefix for IPFS/NS and swarm is using proper Namespace but there’s no onion3 namespace, so current implementation is using this format

const onion3 = "p53lf57qovyuvwsc6xnrppyply3vtqm7l6pcobkmyqsiofyeznfu5uqd";
const onion3_contentHash = "bd03"+hexBytes(onion3);

Following Tor/Onion3 Address Spec :

https://spec.torproject.org/rend-spec/encoding-onion-addresses.html

 onion_address = base32(PUBKEY | CHECKSUM | VERSION) + ".onion"
 CHECKSUM = H(".onion checksum" | PUBKEY | VERSION)[:2]

 where:
   - PUBKEY is the 32 bytes ed25519 master pubkey of the hidden service.
   - VERSION is a one byte version field (default value '\x03')
   - ".onion checksum" is a constant string
   - CHECKSUM is truncated to two bytes before inserting it in onion_address
 Here are a few example addresses:
   pg6mmjiyjmcrsslvykfwnntlaru7p5svn6y2ymmju6nubxndf4pscryd.onion
   sp3k262uwy4r2k3ycr5awluarykdpag6a7y33jxop4cs2lu5uz5sseqd.onion
   xa4r2iadxm55fbnqgwwi5mymqdcofiu3w6rpbtqn7b2dyn7mgwj64jyd.onion
 
** H = SHA3

Alt format *IF using same “bd03” multiaddr prefix as NS & base32 decoder.

0xbd03+ decodebase32(onion3_addr)
0xbd03 + 7f76b2f7f075714ada42f5db17bf0f5e3759c19f5f9e27054cc424871704cb4b4ed203

Correct formats :
cidv1 onion multiaddr format without using bd03 multiaddr as NS. * there’s no NS for onion3.
a) onion3 multi addr + identity with checksum & onion version =3
<v=1><multiaddr=onion3><id=0><length=32+2+1>
= 01bd0300247f76b2f7f075714ada42f5db17bf0f5e3759c19f5f9e27054cc424871704cb4b4ed203

base32 : bag6qgabdp53lf57qovyuvwsc6xnrppyply3vtqm7l6pcobkmyqsiofyeznfu5uqd
base36 : k62lgnso3wxo2p8d18ab5igmlsdvz4l54gbhyxj78rdvex9s20lzrp8zcy270j

b) onion3 multiaddr with proper ed25519 id & NO checksum and onion3 version suffix.
<v=1><multiaddr=onion3><Iden/type=ed01><len=32>
= 01bd03ed01207f76b2f7f075714ada42f5db17bf0f5e3759c19f5f9e27054cc424871704cb4b
base32 : bag6qh3ibeb7xnmxx6b2xcsw2il25wf57b5pdowobt5pz4jyfjtccjbyxatfuw

This proves current onion3 contenthash is make believe CIDv1. Proper fix will require a new namespace for onion3 or could use ENS side of NS specs. speaking of which, @raffy could rebrand current PR for datauri NS to be full ENS namespace for everything ENS… & for record, we’re all doing this cid thing in reverse, ENS should specify its own NS/specs and ask multiformats to include those specs in table, not other way around.

2 Likes

I’m not sure I follow, onion3 contenthash is not encoded as a CID.

It is just <protoCode><data> like ENSIP-7 says.

From resolverworks/enson.js

const chash = Chash.fromOnion('p53lf57qovyuvwsc6xnrppyply3vtqm7l6pcobkmyqsiofyeznfu5uqd');
console.log(chash.toObject());
console.log(chash.toPhex());
{
  protocol: { codec: 445, name: 'Onion' },
  url: 'http://p53lf57qovyuvwsc6xnrppyply3vtqm7l6pcobkmyqsiofyeznfu5uqd.onion',
  pubkey: Uint8Array(32) [
    127, 118, 178, 247, 240, 117, 113, 74,
    218,  66, 245, 219,  23, 191,  15, 94,
     55,  89, 193, 159,  95, 158,  39,  5,
     76, 196,  36, 135,  23,   4, 203, 75
  ],
  checksum: Uint8Array(2) [ 78, 210 ],
  version: 3
}
0xbd037f76b2f7f075714ada42f5db17bf0f5e3759c19f5f9e27054cc424871704cb4b4ed203

Oh, I see, that hex doesn’t match. It looks like the ENS example is using <protoCode><ASCII>

const onion3_contentHash = "bd037035336c663537716f7679757677736336786e72707079706c79337674716d376c3670636f626b6d797173696f6679657a6e667535757164";

0x37035336c663537716f7679757677736336786e72707079706c79337674716d376c3670636f626b6d797173696f6679657a6e667535757164 → "p53lf57qovyuvwsc6xnrppyply3vtqm7l6pcobkmyqsiofyeznfu5uqd"

1 Like

even if we accept that as correct cidv1 <protocode>, onion3 addr should be base32 decoded before prefixing that multiaddr.

0xbd037f76b2f7f075714ada42f5db17bf0f5e3759c19f5f9e27054cc424871704cb4b4ed203

Vs.

0xbd037035336c663537716f7679757677736336786e72707079706c79337674716d376c3670636f626b6d797173696f6679657a6e667535757164

we can check what is really <protocode> format with swarm, as it’s not “ipfs/ns”. OR with IPNS as it’s NOT IPLD type & its using libp2pkey which defaults to ed25519 (*also supports secp256k1 keys)

swarm hash : d1de9994b4d039f6548d191eb26786769f580809256b4685ef316805265ea162
swarm.namespace = 0xe401 = varint(0xe4)
cid.v1 = 0x01
swarm.manifest = 0xfa01 = varint(0xfa)
multihash.keccak256 = 0x1b
hash.length = 0x20 = 32

Swarm.contenthash : e40101fa011b20d1de9994b4d039f6548d191eb26786769f580809256b4685ef316805265ea162

IF we really don’t care to follow CIDv1 properly we can reuse bd03 as both namespace & multiaddr. :rofl: Anyway current implementation can’t represent all “cidv1 contenthash” as proper cidv1 “b”, “k” & “f” prefixed base32/36/16 strings used for dweb2/3 gateways.

1 Like

onion is not encoded as a CID though! Maybe this was the same sticking point about Data URL.

ENSIP-7’s <protoCode><bytes> just says bytes follows the multicodec stuff and… onion is encoded just as bytes multiaddr not a CID — oh, interesting, well I have no idea what the proper multiaddr encoding of a onion is, nor does the official repo :rofl:. I agree the current encoding is not self-describing. But this is how I think onion and other new formats should be encoded.


Speaking more generally: I think all of the multicodec stuff is noise and it’s a solution to a problem that ENS doesn’t have. ENS can encode arbitrary bytes without any ambiguity.

There’s nothing wrong with IPFS and others using multicodec and CID but that decision shouldn’t be enforced universally.

All contenthashes can be mapped to an URI:

  • URI: <scheme>://<ascii-like>.
  • Contenthash: <protoCode><bytes>

Notice how this is exactly the same as ENSIP-7 except URI is constrained by legacy protocol transmission rules (ascii, octets, dns, small buffers, etc.) and contenthash simply assumes the parent protocol can transmit arbitrary bytes.

This is also conceptually the same as addr(coinType) where coinType (like protoCode) is sufficient for decoding the value. With addr() you ask the name for type X, whereas contenthash() embeds the type X because there is only one contenthash.

  • addr(x) = y
  • contenthash() = x + y

For example, I’ve made the point before that contenthash could of just been an exotic coinType: addr(CH) = x + y

1 Like

I was checking Dune/ENS stats, there’s total 48 onion3 contenthash in old+new public resolvers, 46+2. I guess this onion3 contenthash format is locked forever…

We’ll end up creating more prefix collision with cidv1 in same context. Or new multiformat table for ENS only…

As we’re breaking self-describing cidv1 with onion3, once NS prefix is removed for gateway to handle it’staking extra steps to check if it’s base32 cidv1 prefixed with “b” OR random onion base32 string starting with “b”… I’m now covering this onion3 hole with extra length checks =56, but we also support RAW-ipld without fixed length so we’ve to double check for that specific length.

At minimum we should get ENS namespace for everything defined by future ENSIP data specs… only then we can request those new ENS codec/data prefix specs to be included in multiformat to prevent future collision between ENS context and other other protocols.

eg, <ENS_NS>+<v1>+<ENS.datauri.codec>+ <data.length> <hex(data:...)>

if it was up to me, I’d go full R&D in ENS+IPLD maxi… Add new ENS specific NS, namehash in multihash & more codecs to cover fully verifiable+chainless (chain agnostic) ENS records & ownership in parallel to L2/v2, *with minimum ~1 call to ETH/L2 RPC for ownership data.

^ That’s full IPLD+ETH specs to store & retrieve full chain/tx data since genesis block using IPFS nodes… We don’t have any ENS+IPLD specs to cover ENS sub-ownership & records+metadata storage. Our current solution is to throw a graphql endpoint for this, we can’t decentralize that but it works. :stuck_out_tongue: