ERC-3668 edge case for clientside http/ssl errors

0xc0de4c0ffee · September 23, 2024, 12:16pm

Under ERC-3668 “Client Lookup Protocol”

…
…
5. If the sender field does not match the address of the contract that was called, return an error to the caller and stop.
6. Construct a request URL by replacing sender with the lowercase 0x-prefixed hexadecimal formatted sender parameter, and replacing data with the 0x-prefixed hexadecimal formatted callData parameter. The client may choose which URLs to try in which order, but SHOULD prioritise URLs earlier in the list over those later in the list.
7. Make an HTTP GET request to the request URL.
8. If the response code from step (5) is in the range 400-499, return an error to the caller and stop.
9. If the response code from step (5) is in the range 500-599, go back to step (5) and pick a different URL, or stop if there are no further URLs to try.
10. Otherwise, replace data with an ABI-encoded call to the contract function specified by the 4-byte selector callbackFunction, supplying the data returned from step (7) and extraData from step (4), and return to step (1).

I think (5) mentioned in 8 & 9 is typo from draft, that should be (7)??

During CCIP read IF (7) request fails with clientside/network errors (server rejects or ssl errors) without any http status code, that’s preventing auto fallback to secondary gateway url. For failsafe scenario CCIP clients should try all fallback gateways until status code == 200 before giving up.

cc @ethlimo.eth here we go

nick.eth · September 23, 2024, 1:07pm

Good catch, that’s correct.

Yes, correct.

0xc0de4c0ffee · September 23, 2024, 3:38pm

@ethlimo.eth opened this issue for ethers.js.
it’ll take one more try catch in ccip clients.

github.com/ethers-io/ethers.js

CCIP-read fallback on network errors

opened 02:41PM - 23 Sep 24 UTC

eth-limo

enhancement

### Describe the Feature Hello, Per this thread: https://discuss.ens.domain…s/t/erc-3668-edge-case-for-clientside-http-ssl-errors/19617, it looks like network errors or non-http error responses can prevent CCIP clients from falling back to a secondary gateway URL. I believe this is also impacting the CCIP-read client in ethers. This would be a great addition to ethers.js as user implemented offchain resolvers continues to grow. Thanks! ### Code Example _No response_

0xc0de4c0ffee · September 28, 2024, 10:29am

reporting with more test in ethers.js & viem

Viem Test : It’s handling CCIP fallback correctly.

try {
  const result = await viem.ccipRequest({
      data: '0xc0de4c0ffee',
      sender: '0xc0de4c0ffeeec0de4c0ffeeec0de4c0ffeeec0de',
      urls: [
        `https://down.gateway/{sender}/{data}`,
        `https://fallback.gateway/{sender}/{data}`
      ],
    })
} catch (e){
    console.log(e)
}

>> HttpRequestError: HTTP request failed.
URL: https://fallback.gateway/{sender}/{data}
Details: Failed to fetch
Version: viem@2.17.4

Viem looks ok as it’s using universal resolver

github.com

wevm/viem/blob/320e2dc468248168691e6d3f3e2705d33e52c423/src/utils/ccip.ts#L133-L190


      
          let error = new Error('An unknown error occurred.')
          
          for (let i = 0; i < urls.length; i++) {
            const url = urls[i]
            const method = url.includes('{data}') ? 'GET' : 'POST'
            const body = method === 'POST' ? { data, sender } : undefined
          
            try {
              const response = await fetch(
                url.replace('{sender}', sender).replace('{data}', data),
                {
                  body: JSON.stringify(body),
                  method,
                },
              )
          
              let result: any
              if (
                response.headers.get('Content-Type')?.startsWith('application/json')
              ) {

This file has been truncated. show original

On ethers.js it’s sending fetch request outside of try catch. No Universal Resolver.

github.com

ethers-io/ethers.js/blob/5aba4963e3e8ddfc912747076f5b7fe7a743cfe2/src.ts/providers/abstract-provider.ts#L605-L630


      
              let errorMessage = "unknown error";
          
              const resp = await request.send();
              try {
                   const result = resp.bodyJson;
                   if (result.data) {
                       this.emit("debug", { action: "receiveCcipReadFetchResult", request, result });
                       return result.data;
                   }
                   if (result.message) { errorMessage = result.message; }
                   this.emit("debug", { action: "receiveCcipReadFetchError", request, result });
              } catch (error) { }
          
              // 4xx indicates the result is not present; stop
              assert(resp.statusCode < 400 || resp.statusCode >= 500, `response not found during CCIP fetch: ${ errorMessage }`,
                  "OFFCHAIN_FAULT", { reason: "404_MISSING_RESOURCE", transaction: tx, info: { url, errorMessage } });
          
              // 5xx indicates server issue; try the next url
              errorMessages.push(errorMessage);
          }

This file has been truncated. show original

Edit : almost forgot ethers.js error logs, we’ve listed total primary+3 IPFS gateways for CCIP read.

ethers.min.js:1 
 GET https://e501017….99f7d5d….eab2c38eedc50…1a8bd0def.ipfs2.eth.limo/.well-known/eth/freetib/contenthash.json?t=0x5824 
net::ERR_HTTP2_PROTOCOL_ERROR

ethers.min.js:1 
 Uncaught (in promise) TypeError: Failed to fetch
    at FetchRequest.getUrl (ethers.min.js:1:15378)
    at #send (ethers.min.js:1:21605)
    ....

raffy · September 29, 2024, 10:27pm

To clarify, you mean status 4XX shouldn’t abort the gateway iteration?
What about 2XX with junk?

I’d go one step further and say termination should be decided by the contract.

OffchainNext.sol lets the contract decide if it wants to accept the response. It does randomized gateway iteration too. It works today w/o any framework modification. It only does 1-of-n but it could be extended to do m-of-n.

I have a pretty cool demo where I have an setup with various endpoints + one good one, and forcibly try all permutations and request always succeeds (at the expense of extra latency.)

I also have a demo where I use this with an EVMGateway, where it blocks a malicious gateway from supplying it invalid proofs (which would appear like normal 200 properly formatted response.) This illustrates the nuance in the error types:

if 200 but "kek" or {data: "0x"}, that’s an invalid response → next
if 200 with valid looking proofs but they were wrong, that’s an invalid response → next
if 200 with valid proofs that assert you can’t do something or divide by zero, that’s a valid response but an execution error → fail
if 200 with valid proofs and success → success

Additionally, we should also emphasize that {sender} is not necessarily actual requestor, and that requesting identity should be encoded into the endpoint URL to indicate the chain and contract (eg. signing relative to sender is bad practice.)

nick.eth · September 30, 2024, 1:01pm

Tagging @ricmoo for Ethers support.

ricmoo · September 30, 2024, 8:44pm

Thanks. Looking into this now.

ricmoo · September 30, 2024, 9:30pm

I think it is likely the same root cause as this issue, if you could check and see if you agree?

Do you have a simple test case I can use to test this and the confirm the fix addresses it?

ricmoo · September 30, 2024, 9:31pm

lol. I should have clicked through the link in the issue first; looks like it references this post. le sigh…

0xc0de4c0ffee · September 30, 2024, 9:37pm

Aborting after 4xx error will give full power to first gateway in list to halt whole lookup process.
eg, in a randomized gateway list of N =5, if 1 gateway is throwing 4XX that’s 20% failed offchain lookup as there’s no fallback.

while(request.status !== 200){
    //.. try catch next gateway
}

erc3668 is strict with that {data:"..."}, if a bad gateway wants resolver to handle that junk? it’s not breaking anything on erc3668, jus do recurcise ccip lookup directly from resolver like m-of-n multisig .

I’m not sure how far can we go without breaking ERC3668 specs. it’s cleverly designed to fit in between web2 & web3, so whole looklup process is trying to mimic http but we want more failsafe scenario for web3/decentralization…

– I’m re/thinking erc7700 draft with this erc3668 & gateways stuffs in mind… I’ll share that erc7700 redesign/suggestion here soon. we can’t change anything big in erc3668 after 4 years.

raffy · September 30, 2024, 9:58pm

Did you look at OffchainNext and my example? It lets the contract decide w/o any modification–but it can’t circumvent the 4XX ethers issue.

0xc0de4c0ffee · September 30, 2024, 10:33pm

I see you’re using data uri to act like final failsafe gateway…

_shouldTryNext should also be triggered after signature/length check fails.

there’s one more thing to check from ERC3668.

This protocol can result in multiple lookups being requested by the same contract. Clients MUST implement a limit on the number of lookups they permit for a single contract call, and this limit SHOULD be at least 4.

ricmoo · October 1, 2024, 2:41am

FYI. The issue has been fixed in Ethers. See the v6.13.3 notes for details.

Thanks for letting me know.

raffy · October 19, 2024, 2:04am

I would like to reopen this issue for further discussion. I greatly appreciate ricmoo doing a fast update on ethers and coffee for bringing attention to the problem however I think 3668 and the implementations need further changes.

tl;dr in adraffy/ur-poc I implemented (2) things:

a new UniversalResolver (UR.sol) which is clean, has support for resolve(multicall), allows for arbitrary callback recursion, and provides extra information.
a proof-of-concept of a better ERC-3668 algorithm, implemented as an ethers ContractRunner (CCIPReadRunner.ts)

CCIPReadRunner

if enableCcipRead = false → do the normal thing
call the contract function
if it doesn’t revert → return
if reverts but isn’t OffchainLookup → throw
if sender doesn’t match → throw
start with attempts = 20, index = 0
next: while index < endpoints
1. decrease attempts, if 0 → throw
2. if connection fails → next
3. ignore status code: if response isn’t json → next
4. if json doesn’t have {data: HexString} → next
5. if data is OffchainLookupUnanswered() → next
6. call the contract callback with the response
7. if it doesn’t revert → return
8. if it reverts OffchainTryNext and sender matches → next
9. if it reverts OffchainLookup and sender matches, replace endpoints, reset index = 0, next
10. otherwise, throw
call the contract callback with OffchainLookupUnanswered()
if it doesn’t revert, return
if it reverts OffchainLookup and sender matches, replace endpoints, reset index = 0, next
otherwise throw

New additions:

error OffchainTryNext(sender) thrown by the contract if the response is not acceptable
error OffchainLookupUnanswered() supplied to the callback after reaching the last endpoint without success, cannot be called from an endpoint response

This is almost the same as the current algorithm except 4XX is not fatal. Instead, the contract has full control over iteration and can fail gracefully if no endpoint is alive. This algorithm is backwards compatible with ERC-3668.

UR

function resolve(
    bytes memory name,
    bytes[] memory calls
) external view returns (Lookup memory lookup, Response[] memory res);
struct Lookup {
    uint256 offset;
    bytes32 node;
    address resolver;
    bool extended;
}
struct Response {
    uint256 bits;
    bytes data;
}
uint256 constant ERROR_BIT    = 1 << 0; // resolution failed
uint256 constant OFFCHAIN_BIT = 1 << 1; // reverted OffchainLookup
uint256 constant BATCHED_BIT  = 1 << 2; // used Batched Gateway
uint256 constant RESOLVED_BIT = 1 << 3; // resolution finished (internal flag)

does ENSIP-10
allocatesres = new Responses[](#calls)
if extended, wraps each call in resolve()
if onchain, sets RESOLVED_BIT, if failed, sets ERROR_BIT
otherwise, sets OFFCHAIN_BIT
if only 1 call is missing, wrap that call
if multicall calls are missing, try resolve(multicall)
if we haven’t reverted yet and missing > 0, use the batched gateway
otherwise, done

On callback of a single call:

if offchain again, rewrap
otherwise, set RESOLVED_BIT, if failed, set ERROR_BIT, done

On callback of resolve(multicall)

if ok, decode as multicall, done
otherwise, give up and use the batched gateway

On callback of batched gateway

match returned results with those without RESOLVED_BIT
collect those that reverted again on callback
if any reverted, use batched gateway again
otherwise, done

This implementation is very simple. It greatly reduces the need for the batched gateway as more CCIP-Read servers implement resolve(multicall).

Note: This design can also have one more optimization which only uses the batched gateway for unwrappable contracts (those that incorrectly restrict the sender to a single contract, rather than identify the caller via the endpoint.)

UR.test.ts is a test suite for existing ENS names that tests a variety of situations, including hybrid and traditionally unwrappable names (like Coinbase.)
bun run test
fetch.ts is a simple CLI that lets you test the UR with any name: bun run fetch <name> <records...>

bun run fetch raffy.eth
bun run fetch raffy.eth addr
bun run fetch raffy.eth addr addr:60
bun run fetch raffy.eth text:avatar addr:60 chash pubkey

ccip.ts is a demonstration of how the current ERC-3668 algorithm cannot wrap a server that throws 4XX but with the new CCIPReadRunner it can.

Miscellaneous notes:

ExtendedDNSResolver uses the wrong address encoding: causes addr(60) to be zeropadded and addr() to fail or be 0x00..20

0xc0de4c0ffee · October 19, 2024, 11:32am

This will require new ERC/ENSIP? or updating erc3668 to retry after 4xx error & bad data:…?
or it’s used as wrapper like current universal resolve with it’s own ccip proxy?

We couldn’t even update erc5559 (@ Stagnant | Standards Track ERC) which is only used/tested by few… erc7700 is in draft & we are fully redesigning current draft to cover all future chain-agnostic/cross chain storage providers as our current draft was more focused on 5599/3668 type only.

~in short we can wrap erc3668/5599 selector in 7700 revert to use fail safe mode like CCIPReadRunner logic. We can even add full paranoid mode client requesting all ccip gateways and crosschecking results… or as recursive multisig CCIP calls?

Still bigger concern on CCIP side is centralization & privacy risks… it’s well known issue as I rem reading long posts from ?Nick & Ricmoo few yrs back… 2nd huge issue, current records are signed by gateways, it should be signed by owner or using owner’s permits.

I’ve seen similar issues in old universal resolver, it was reading revert data’s length 0x0000…20 =32 as address. * it’s already fixed in new version…

Universal Resolver Bug

we lost count of our small bugs reports found during deep diving.
that was ~critical bug in old official universal resolver as all ccip checks pass & addr data is returned… BUT ETH/tokens sent to reverting sub/domain.eth would end up in 0x0000…20 address, but back then nobody was into deep universal/wildcard with callback&fallback mode.

raffy · October 19, 2024, 8:09pm

I’m not sure, possibly an update to the spec plus a small client update.

An existing contract interacting with a new client would:

never throw OffchainTryNext() — so no issue.
a 4XX error with valid {data} would trigger the callback — currently this case blows up.
it would almost always blow up if called with OffchainLookupUnanswered() — only pass-through resolvers (I don’t know of any) would accept an 4-byte response. Anything that does abi.decode() or expects data would revert(). Currently, if you get to the end of the endpoints, it blows up.

And existing client encountering an new contract would:

blow up if a new contract reverts OffchainTryNext() in it’s callback — the current behavior.
not call OffchainLookupUnanswered() at the end — the current behavior.

raffy · October 20, 2024, 3:57am

I improved this a bit more:

URAlwaysBatched.sol — is a UR implementation that works with CCIP-Read as-is but still supports resolve(multicall) (by asking first for the batched gateway to do resolve(multicall) then breaking it apart it if fails.)
protocol.test.ts — shows how the new protocol works relative to the current implementation.
- the OffchainTryNext() test survives 4XX, invalid response encodings, and invalid server data

0xc0de4c0ffee · October 22, 2024, 5:05am

ccip.endpoint + "/404",
ccip.endpoint + "/500",
ccip.endpoint + "/wrong",
ccip.endpoint + "/malicious",
ccip.endpoint,

Thanks for full tests, I’m too lazy to do that all alone.

simple erc3668 to stop 4xx/bad data from bad gateways halting whole lookup process.
a) if 4xx, retry with next gateway
b) if 200 but no json OR bad {data:0x/InvalidHex…} format, retry with next gateway
c) any bad {data:0xValidHex…} is auto checked/handled by resolver, shuffle gateways & retry as we’ve no way of knowing which gateway triggered that.

raffy · October 22, 2024, 7:19am

Without breaking the callback, you can achieve that by doing separate single endpoint OffchainLookup()'s since you receive OffchainLookupUnanswered() if it is unsuccessful and can try again with the next. This is the same technique as OffchainNext.sol except it doesn’t require the data-url.

Alternatively, you could data-url the first endpoint with an signal value, indicating that the remaining endpoints should be tried randomly and the callback will contain extra data. I think that’s also backwards compat.

0xc0de4c0ffee · October 23, 2024, 4:44am

let’s do full check & possible fix without introducing any new stuffs like OffchainLookupUnanswered() or OffchainTryNext(), + end resolvers are always free to use their own extra checks/logics without breaking erc3668.

Client Lookup Protocol

A client that supports CCIP read MUST make contract calls using the following process:

Set data to the call data to supply to the contract, and to to the address of the contract to call.

Call the contract at address to function normally, supplying data as the input data. If the function returns a successful result, return it to the caller and stop.

If the function returns an error other than OffchainLookup, return it to the caller in the usual fashion.

Otherwise, decode the sender, urls, callData, callbackFunction and extraData arguments from the OffchainLookup error.

If the sender field does not match the address of the contract that was called, return an error to the caller and stop.

Construct a request URL by replacing sender with the lowercase 0x-prefixed hexadecimal formatted sender parameter, and replacing data with the 0x-prefixed hexadecimal formatted callData parameter. The client may choose which URLs to try in which order, but SHOULD prioritise URLs earlier in the list over those later in the list.

Make an HTTP GET request to the request URL.

If the response code from step (7) is in the range 400-499, return an error to the caller and stop.

If the response code from step (7) is in the range 500-599, go back to step (5) and pick a different URL, or stop if there are no further URLs to try.

Otherwise, replace data with an ABI-encoded call to the contract function specified by the 4-byte selector callbackFunction, supplying the data returned from step (7) and extraData from step (4), and return to step (1).

Clients MUST handle HTTP status codes appropriately, employing best practices for error reporting and retries.

Clients MUST handle HTTP 4xx and 5xx error responses that have a content type other than application/json appropriately; they MUST NOT attempt to parse the response body as JSON.

This protocol can result in multiple lookups being requested by the same contract. Clients MUST implement a limit on the number of lookups they permit for a single contract call, and this limit SHOULD be at least 4.

(8) & (9) should be merged, & 4xx should use same fallback logic as 5xx.
possibly replace (8) with "if response json doesn’t contain {"data": ...}" or data value is not valid 0x prefixed hex, go back to step (5) and pick a different URL, or stop if there are no further URLs to try.
__
It’s kinda hard to update current EC3668 after years with this fix for bad/compromised gateways halting whole lookup process with 4xx error OR bad data:InvalidHex I’m not sure if it’s possible to update old ERCs with this critical fix?
cc @nick.eth