web3-storage / web3.storage

DEPRECATED ⁂ The simple file storage service for IPFS & Filecoin
https://web3.storage
Other
501 stars 122 forks source link

retriving of the files fails via node SDK does not work if file is uploaded recently #1810

Closed Dygnify closed 1 year ago

Dygnify commented 1 year ago

when using 'web3.storage' javascript sometimes i am getting error that "Error: block with cid bafybeielftk56bjhtftrztt3pksclpol5vmhv6bkqx44zo3nlhz6tah4wa no found" but when checked in web3.storage portal this cid exists

detailed exception Error: block with cid bafybeielftk56bjhtftrztt3pksclpol5vmhv6bkqx44zo3nlhz6tah4wa no found at MemoryBlockStore.get (memory.js:20:1) at VerifyingGetOnlyBlockStore.get (verifying-get-only-blockstore.js:10:1) at unixFsResolver (index.js:25:1) at resolve (index.js:21:1) at walkPath (index.js:40:1) at walkPath.next () at last (index.js:13:1) at exporter (index.js:57:1) at recursive (index.js:64:1) at recursive.next () ** Code to retrieve file using cid

import { Web3Storage, File } from "web3.storage";

function makeStorageClient() { return new Web3Storage({ token: process.env.REACT_APP_WEB3STORAGE_APIKEY }); }

async function getFile(cid) { const client = makeStorageClient(); const res = await client.get(cid); const files = await res.files(); return files; }

when checked it fails at "await res.files();"

And all our uploaded files are queueing since last month

Screenshot 2022-08-27 at 1 27 11 PM
vasco-santos commented 1 year ago

Hi @Dygnify

There is sometime delay between write into storage provider and a read provider to be able to fulfil this request. When write is made, CAR file will be indexed, CIDs of each block will then be provided in the network and finally IPFS nodes will have information about which peer is storing a given block CID.

We are working on improvements on our reads pipeline that "connects" it to the writes pipeline. When we do that, we should be able to offer reads faster after an upload happens, but there will be always a bit of delay moving the data around.

Let me know if you have any questions in the meantime

Dygnify commented 1 year ago

@vasco-santos thanks for the explaination. but my question is if this same file can be accessed via https gateway, why can't same be available via SDK, after file upload if I try to access the file via 'https://.ipfs.w3s.link' immediately it works fine, but same thing gives error via SDK that block with cid no found.

If it works via https gateway it should work via web3.storage SDK also

codeicey commented 1 year ago

I ran into same error. Need a way to automate retrieving and displaying those images.

vasco-santos commented 1 year ago

@Dygnify @madhukar123456 you mean that you can access via w3link but not when using web3.storage client? That is unexpected, if so we need to look into what is happening

codeicey commented 1 year ago

@Dygnify @madhukar123456 you mean that you can access via w3link but not when using web3.storage client? That is unexpected, if so we need to look into what is happening

yeah res.files() doesnt work. I've been using the ipfs link method to view images.

Dygnify commented 1 year ago

Yes correct, res.files() doesnt work.

zhanleewo commented 1 year ago

Hi @vasco-santos I'd like to give you a detailed bug report.

Environment

Chrome for Mac

Steps to reproduce

Upload and download multiple times

Frequency

Unpredictable (Looks like it's related to the load of the Web3 Storage Service)

Description

We came across similar issues in our application as well. Each client in our application would typically create a message (as a file) first in their own Web3.Storage space and then deliver the message identifier (CID) to one of their contacts. The intended recipient of the message identifier usually makes a request for the message shortly after receiving the notification.

Through the official logic in the JavaScript client library for retrieving the message by the recipient, we frequently encounter the “cid xxx not found” error. Most retrieval requests are issued after the sender sends the message notification and before it enters into the “Stored” state. We all know it takes many days to get to the “Stored” state from “Queuing”. It appears that the processing is more stable after entering the “Stored” state. CID Not Found

Using Chrome browser’s Dev Tool to monitor network traffic while accessing file stored in Web3 Storage with official JavaScript client library, only 278 B was retrieved, but using a public IPNS gateway directly, the whole message of 1.1 KB can be retrieved completely.

Small Size

Get into a bit more details, in Chrome browser, the response contains only the CAR header, without the follow-up content. CID Header Only

In your previous reply:

Hi @Dygnify

There is sometime delay between write into storage provider and a read provider to be able to fulfil this request. When write is made, CAR file will be indexed, CIDs of each block will then be provided in the network and finally IPFS nodes will have information about which peer is storing a given block CID.

We are working on improvements on our reads pipeline that "connects" it to the writes pipeline. When we do that, we should be able to offer reads faster after an upload happens, but there will be always a bit of delay moving the data around.

Let me know if you have any questions in the meantime

As you said, it takes time to traverse all the steps in order to distribute the file into their appropriate IPFS nodes. https://github.com/web3-storage/web3.storage/blob/main/packages/api/src/car.js#L145

  const tasks = [async () => {
    try {
      await pinToCluster(sourceCid, env)
    } catch (err) {
      console.warn('failed to pin to cluster', err)
    }

    const pins = await addToCluster(car, env)

    await env.db.upsertPins(pins.map(p => ({
      status: p.status,
      contentCid,
      location: p.location
    })))

    // Retrieve current pin status and info about the nodes pinning the content.
    // Keep querying Cluster until one of the nodes reports something other than
    // Unpinned i.e. PinQueued or Pinning or Pinned.
    if (!pins.some(p => PIN_OK_STATUS.includes(p.status))) {
      await waitAndUpdateOkPins(contentCid, env.cluster, env.db)
    }
  }]

When the error/exception occurs, from the JavaScript client library: https://github.com/web3-storage/ipfs-car/blob/main/src/unpack/index.ts#L26

export async function* unpackStream(readable: ReadableStream<Uint8Array> | AsyncIterable<Uint8Array>, { roots, blockstore: userBlockstore }: { roots?: CID[], blockstore?: Blockstore } = {}): AsyncIterable<UnixFSEntry> {
  const carIterator = await CarBlockIterator.fromIterable(asAsyncIterable(readable))
  const blockstore = userBlockstore || new MemoryBlockStore()

  for await (const block of carIterator) {
    await blockstore.put(block.cid, block.bytes)
  }

  const verifyingBlockStore = VerifyingGetOnlyBlockStore.fromBlockstore(blockstore)

  if (!roots || roots.length === 0 ) {
    roots = await carIterator.getRoots()
  }

  for (const root of roots) {
    yield* unixFsExporter(root, verifyingBlockStore)
  }
}

Since the data in the readable is incomplete, the blockStorepassed into unixFsExporter contains no relevant data. This is consistent with your reasoning/explanation. However, what is perplexing about this is that the matching document for the said cid is retrievable from any usable IPFS gateway, but just not from your API. This is strange, as per the code below, the API does not do anything extra other than forwarding the request to IPFS cluster gateway.

NOTE: Once an incomplete CAR is returned, the JavaScript client library will still be inaccessible even if the Public IPFS Gateway can be accessed normally until the browser cache expires. I suggest that in this situation the browser should avoid caching the request.

https://github.com/web3-storage/web3.storage/blob/main/packages/api/src/car.js#L49

export async function carGet (request, env, ctx) {
  const cache = caches.default
  let res = await cache.match(request)

  if (res) {
    return res
  }

  const {
    params: { cid }
  } = request
  // gateway does not support `carversion` yet.
  // using it now means we can skip the cache if it is supported in the future
  const url = new URL(`/api/v0/dag/export?arg=${cid}&carversion=1`, env.GATEWAY_URL)
  res = await fetch(url.toString(), { method: 'POST' })
  if (!res.ok) {
    // bail early. dont cache errors.
    return res
  }
  // Clone the response so that it's no longer immutable. Ditch the original headers.
  // Note: keeping the original headers seems to prevent the carHead function from setting Content-Length
  res = new Response(res.body)
  res.headers.set('Content-Type', 'application/vnd.ipld.car')
  // cache for 1 year, the max max-age value.
  res.headers.set('Cache-Control', 'public, max-age=31536000')
  // without the content-disposition, firefox describes them as DMS files.
  res.headers.set('Content-Disposition', `attachment; filename="${cid}.car"`)
  // always https pls.
  res.headers.set('Strict-Transport-Security', 'max-age=31536000; includeSubDomains; preload"')
  // // compress if asked for? is it worth it?
  // if (request.headers.get('Accept-Encoding').match('gzip')) {
  //   headers['Content-Encoding'] = 'gzip'
  // }
  ctx.waitUntil(cache.put(request, res.clone()))
  return res
}

CID Header Only

Additionally, on testing "retrieve file”, we frequently come across two other problems:

  1. 504 Gateway Timeout or 524 A Timeout Occurred
  2. Frequently, on large file retrieval (about 300M, not excessively large), we sometimes receive no new progress after waiting for hours.

504 Gateway Timeout

Assertion

We speculate that there are bugs in your private IPFS cluster or your private IPFS cluster may be short on resources.

Dygnify commented 1 year ago

I've also faced 504 gateway timeout for large files and that too on https gateway also

vasco-santos commented 1 year ago

Hey folks!

Currently only ipfs.io gateway supports format=car export and that is what web3.storage JS client is currently relying on. We will soon land a new system that will also support that and upstream support in w3link for it. Once we get that, this should be considerably improved.

Regarding the use cases in this issue, do folks intend to use fetch CAR files and unpack them in the client? If not, you can just fetch the response from the w3link gateway directly for now. We can also add support in the SDK to get content directly instead of being a CAR file.

zhanleewo commented 1 year ago

Hey folks!

Currently only ipfs.io gateway supports format=car export and that is what web3.storage JS client is currently relying on. We will soon land a new system that will also support that and upstream support in w3link for it. Once we get that, this should be considerably improved.

Regarding the use cases in this issue, do folks intend to use fetch CAR files and unpack them in the client? If not, you can just fetch the response from the w3link gateway directly for now. We can also add support in the SDK to get content directly instead of being a CAR file.

I try to retrieve the file in CAR from other gateways directly with file name when Error block with cid xxxxxxxx not found occurs through the official Javascript Client Library.

like https://ipfs.io/ipfs/<cid>/<filename>.

Even if I fetch files from other gateways it could also cause 504 Gateway Timeout or 524 A Timeout Occurred.

vasco-santos commented 1 year ago

@zhanleewo the SDK retrieves file as a CAR format and unpack the CAR within the client. The public gateways on the other side, give you the unixfs exported format (not CAR), unless CAR is explicitly required (via header, or query parameter in gateways that support it).

If you are also having issues with ipfs.io or w3s.link it is likely that the upload did not complete well. Can you give me the CID so that I can try to diagnose on my end?

Dygnify commented 1 year ago

@vasco-santos I really appreaciate the responce on this but I think our requirement is something where we upload the file on IPFS and then after within few seconds or so we want to retrieve it to perform some other flow, I can assume from your responces that immediate retrival of the files after upload to IPFS is currently not supported and there is some work going on to improve this situation, may I know how much time it'll take and how soon that will be available in SDK?

zhanleewo commented 1 year ago

@zhanleewo the SDK retrieves file as a CAR format and unpack the CAR within the client. The public gateways on the other side, give you the unixfs exported format (not CAR), unless CAR is explicitly required (via header, or query parameter in gateways that support it).

If you are also having issues with ipfs.io or w3s.link it is likely that the upload did not complete well. Can you give me the CID so that I can try to diagnose on my end?

@vasco-santos Thanks for your response on this. The CAR files corresponding to the following cids cannot be retreived normally. bafybeicuccxffegr7to5q5j5z3lwdo6tczr6hjw4ffwhiuoxaustmscu5e bafybeidzqcpnlgyfjzxfxeq4vf7uie6m56dzvekyfx6zisih67gdojaoaq

After reporting our frequent difficulties in uploading our files into web3.storage, we have done more tests trying to isolate the scenarios under which the uploading abnormalities typically occur.

The latest test scenario is quite straightforward: uploading 5 (large) files in a row, each around 200MB, consecutively. The uploads are usually successful and uneventfully. After successful uploads, we tried to download those 5 files.

The downloads usually ends up in partial success. Some get downloaded successfully, but almost always, some downloads report the download speed drops to 0 after a while and stuck there.

Repeat the same test scenario multiple times, we found:

  1. Those failed to complete are all in the Pinning state, not able to move ahead into the Pinned state.
  2. Some files move from Pinning to Pinned state fairly quickly. Some others take over several days without being able to move into the Pinned state.

The following two CIDs are those troublesome files that failed to get into the Pinned state:

  1. bafybeicuccxffegr7to5q5j5z3lwdo6tczr6hjw4ffwhiuoxaustmscu5e
  2. bafybeidzqcpnlgyfjzxfxeq4vf7uie6m56dzvekyfx6zisih67gdojaoaq

We suspect they may stay in the Pinning state for a while (in days).

Hopefully this report can help you identify the problems. We have been struggling with this problems for weeks now. Please let us know what more we can help in assisting their resolution.

In the zip package is a test file (put-files-one-by-one.zip), which would recreate the problem as reported earlier. The test program is a modified version of https://github.com/web3-storage/web3.storage/blob/main/packages/client/examples/node.js/put-files-from-fs.js

import { randomBytes } from 'crypto'
import process from 'process'
import minimist from 'minimist'
import { Web3Storage, File } from 'web3.storage'

async function randomFile() {
  const buf = randomBytes(1024 * 1024 * 200);
  const file = new File([buf], 'Random-File-' + (new Date().getTime()) +'.bin')
  return file;
}

async function main () {
  const args = minimist(process.argv.slice(2))
  const token = args.token
  const n = args.n || 5;

  if (!token) {
    console.error('A token is needed. You can create one on https://web3.storage')
    return
  }

  const storage = new Web3Storage({ token })

  for (let i=0; i<n; i++) {
    const file = await randomFile();
    console.log('Putting file "' + file.name + '" into Web3Storage...');
    const cid = await storage.put([file], {
      onRootCidReady: (cid) => {
        console.log('CID of the file: ', cid);
      },
      onStoredChunk: (size) => {
        console.log('Upload Progress: ' + size + ' of ' + file.size);
      },
      name: file.name
    })
    console.log('Content added with CID:', cid);
  }
}

main()
vasco-santos commented 1 year ago

@Dygnify

really appreaciate the responce on this but I think our requirement is something where we upload the file on IPFS and then after within few seconds or so we want to retrieve it to perform some other flow, I can assume from your responces that immediate retrival of the files after upload to IPFS is currently not supported and there is some work going on to improve this situation, may I know how much time it'll take and how soon that will be available in SDK?

It should be supported hitting the gateway directly. The SDK works with CAR files, if instead of using the SDK you are able to use const response = await fetch(https://w3s.link/ipfs/{cid})`, it should work.

If your requirement is to work with CAR files on retrieval, we are working on improving our gateway to better support it. Current SDK relies on dweb.link which does not look super stable. Once w3s.link lands support for CAR retrieval and we change SDK to use w3s.link, we should get this working nicely.

vasco-santos commented 1 year ago

Thanks for all the pointers @zhanleewo . Looking into the provided CIDs

vasco-santos commented 1 year ago

@zhanleewo looking into provided CID: bafybeicuccxffegr7to5q5j5z3lwdo6tczr6hjw4ffwhiuoxaustmscu5e

I can see in our DB we have an expected upload size of 279534847 bytes, per DAG Size computed. However, I can also see that we only received a chunk of that upload with ~9MB. This explains why given CAR is not retrievable, and its state is Pinning forever.

Probably worth creating a new issue, given this is related to how file is being uploaded. Can you in the meantime try the code above with:

const file = new File(buf.buffer, 'Random-File-' + (new Date().getTime()) +'.bin')

if you see Node.js console:

> const b = Buffer.from('hello world')
undefined
> b
<Buffer 68 65 6c 6c 6f 20 77 6f 72 6c 64>
> [b]
[ <Buffer 68 65 6c 6c 6f 20 77 6f 72 6c 64> ]
> b.buffer
ArrayBuffer {
  [Uint8Contents]: <2f 00 00 00 00 00 00 00 2f 00 00 00 00 00 00 00 68 65 6c 6c 6f 20 77 6f 72 6c 64 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... 8092 more bytes>,
  byteLength: 8192
}
Dygnify commented 1 year ago

@vasco-santos there is an issue with direct fetching of cid using the w3s link, on some computers, it doesn't work https://github.com/web3-storage/web3.storage/issues/2109

We tries the remedies mentioned in that issue but still w3s link is not working, so I think SDK is the only other option to work it in a proper and smooth way. Any idea or ETA on when SDK's will provide the immediate response on newly uploaded files?

vasco-santos commented 1 year ago

These are unrelated issues @Dygnify . Issue you mentioned is likely related to ISP blocking w3s.link. SDK currently uses dweb.link gateway, and when we work on porting the mentioned features to the SDK it will also rely on w3s.link.

In the meantime, the only solution you can do is to fallback to use dweb.link if w3s.link fails (you will lose performance...). The real solution for the problem in linked Issue is to contact ISPs blocking w3s.link and asking them to only block malicious content based on subdomain blocking ${cid}.ipfs.w3s.link and not block the entire domain based on malware that other parties put in the network.

Dygnify commented 1 year ago

@vasco-santos you are thinking from a developer perspective when end customers will face this issue how can we track on each user's ISP and tell them not to block w3s.link domain, this is unrealistic.

Don't you think there should be a proper resolution for this?

vasco-santos commented 1 year ago

@Dygnify you can read more about this in our blogpost https://blog.nft.storage/posts/2022-04-29-gateways-and-gatekeepers - so a proper resolution for this is dependent on hundreds of third parties that we don't control. We can contact them once we know who flags the domain.

This is continue work that we are doing. Not only in w3s.link but really all other gateways that server user content that they do not control. We appreciate your help on reporting so that we can offer this service with great reliability.

jjranalli commented 1 year ago

Any solution in sight for this? I find myself unable to retrieve any recently uploaded file (regardless of size, even a few kb). As of now it's hard for user facing apps to rely on web3 storage

vasco-santos commented 1 year ago

@jjranalli did you try using the gateway like suggested above? Any reason for using SDK?