webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://webrecorder.net/browsertrix
GNU Affero General Public License v3.0
201 stars 35 forks source link

[Bug]: Error creating WACZ #2095

Open prestonvanloon opened 1 month ago

prestonvanloon commented 1 month ago

Browsertrix Version

v1.11.7-7a61568

What did you expect to happen? What happened instead?

I am having some DNS issues, probably from resource exhaustion. (Also filed #2094 to allow cpu_limits on crawler)

Error: getaddrinfo EAI_AGAIN my-minio-domain
    at GetAddrInfoReqWrap.onlookupall [as oncomplete] (node:dns:120:26)

When I see this error, the entire crawl is lost and that is frustrating when the crawl has run for 24 hours. I wish that the WACZ upload was attempted multiple times until the upload eventually completes or some threshold is met.

Reproduction instructions

Not sure. I'm using kind 0.24.0 . The cluster conflg is standard, just opens the nodeport.

kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
  extraPortMappings:
  - containerPort: 30870
    hostPort: 8080
    listenAddress: "0.0.0.0"
    protocol: TCP

I'm using an external minio s3 instance. The minio s3 instance has to be behind HTTPS for replays to work, so I cannot provide the IP address.

Screenshots / Video

No response

Environment

No response

Additional details

I've tried every workaround that I could imagine.

prestonvanloon commented 1 month ago

https://github.com/webrecorder/browsertrix/issues/1137 might be a solution, if that feature request was implemented.

prestonvanloon commented 1 month ago

After a bit of reverse engineering, I found an undocumented s3 field access_endpoint_url. With this, I was able to set the endpoint_url to a http://$IP:$PORT such that the DNS does not need to be resolved for uploading WACZ files. Of course, this is incompatible with replays since it is not HTTPS and it's not feasible to obtain a SSL cert for a non-public $IP:$PORT. Then I found access_endpoint_url which I was able to set with the domain name https://domain:port/bucket and this is a sufficient workaround for me.

I think there should be more than 1 attempt to upload the WACZ and if an upload of WACZ ultimately fails, then abort the rest of the crawl since the crawl data is lost.

ikreymer commented 1 month ago

Yes, the access_endpoint_url is designed for something like this. It would be odd that the minio instance is not being found, while the crawler is able to run

Re: dns issue, I'd be surprised if its anything related to resource exhausition - the upload happens when the browser is already shut down generally. Can the crawler find the DNS when it starts running? You can exec in the crawler and see if it can reach the minio node. Probably what we should do is check that the upload endpoint is available when starting the crawl, and fail immediately it is not - we'll probably add this (in the crawler repo).

I believe the crawler pod should be retrying a few times, so it should be retrying automatically - likely the DNS issue is not resolved, so it'll keep failing.