webrecorder / browsertrix-crawler

Run a high-fidelity browser-based web archiving crawler in a single Docker container
https://crawler.docs.browsertrix.com
GNU Affero General Public License v3.0
659 stars 83 forks source link

make browsertrix-crawler runnable in serverless environments #448

Open msramalho opened 11 months ago

msramalho commented 11 months ago

Hi all,

I've been experimenting with making an AWS lambda function for browsertrix-crawler and I've gone some distance but hit a snag that the maintainers are probably better equipped to help with.

The problem is: AWS lambda function environment (I'm guessing other serverless options are similar) runs in a controlled environment where the only write permission to the /tmp directory and no other. For browsertrix-crawler outputs the --cwd option should solve it but it's still trying to write to .local (maybe that's playwright/redis or some other dependency?).

So the current issue error I get is:

mkdir: cannot create directory ‘/.local’: Read-only file system
touch: cannot touch '/.local/share/applications/mimeapps.list': No such file or directory
/usr/bin/google-chrome: line 45: /dev/fd/63: No such file or directory
/usr/bin/google-chrome: line 46: /dev/fd/63: No such file or directory
{
    "logLevel": "warn",
    "context": "redis",
    "message": "ioredis error",
    "details": {
        "error": "[ioredis] Unhandled error event:"
    }
}
{
    "logLevel": "warn",
    "context": "state",
    "message": "Waiting for redis at redis://localhost:6379/0",
    "details": {}
}
{
    "logLevel": "error",
    "context": "general",
    "message": "Crawl failed",
    "details": {
        "type": "exception",
        "message": "Timed out after 30000 ms while waiting for the WS endpoint URL to appear in stdout!",
        "stack": "TimeoutError: Timed out after 30000 ms while waiting for the WS endpoint URL to appear in stdout!\n    at ChromeLauncher.launch (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/node/ProductLauncher.js:123:23)\n    at async Browser._init (file:///app/util/browser.js:236:20)\n    at async Browser.launch (file:///app/util/browser.js:61:5)\n    at async Crawler.crawl (file:///app/crawler.js:821:5)\n    at async Crawler.run (file:///app/crawler.js:311:7)"
    }
}

and this is the version info

{
    "logLevel": "info",
    "context": "general",
    "message": "Browsertrix-Crawler 0.11.2 (with warcio.js 1.6.2 pywb 2.7.4)",
    "details": {}
}

I've put the Dockerfile and lambda_function.py in this gist you can use it if you want to replicate the issue.

For reference, I'm following these instructions: https://docs.aws.amazon.com/lambda/latest/dg/python-image.html And I'm using the API gateway to make testing quick: image

tw4l commented 11 months ago

Thanks for flagging this!

mkdir: cannot create directory ‘/.local’: Read-only file system touch: cannot touch '/.local/share/applications/mimeapps.list': No such file or directory /usr/bin/google-chrome: line 45: /dev/fd/63: No such file or directory /usr/bin/google-chrome: line 46: /dev/fd/63: No such file or directory

Hm, I believe that these errors are from the browser itself, not necessarily Puppeteer. From some quick looking around, it looks like Chromium/Chrome/Brave may need to be built in a slightly different way to be able to run on AWS Lambda. We could probably accomplish this by having a separate browser base for Lambda, or perhaps the changes necessary could just be folded into the main release.

msramalho commented 11 months ago

Thanks, it makes sense it's chrome accessing those dirs.

In that case, a separate base would be the ideal scenario.

Is it possible that all the changes needed can be accommodate by chrome flags that we could already configure with CHROME_FLAGS as described in the README?

This (2yo) medium post points to a set of flags needed for chrome to run in lambda:

const chromeFlags = ['--no-xshm','--disable-dev-shm-usage','--single-process',
'--no-sandbox','--no-first-run',`--load-extension=${extensionDir}`]

// and then actually just 
'--no-first-run'

trying to gather whether it's worth testing that or if it has no future.

tw4l commented 11 months ago

Is it possible that all the changes needed can be accommodate by chrome flags that we could already configure with CHROME_FLAGS as described in the README?

It is possible! Tbh I'd have to dig deeper into it myself to say either way. It's also worth noting that current releases of the crawler are built on Brave Browser (see #189 for rationale), though it's still possible to build the crawler on Chrome/Chromium via the older debs in the https://github.com/webrecorder/browsertrix-browser-base repo.

If you're willing to put some time into investigating this I'd be happy to help/review a PR!

kema-dev commented 7 months ago

Hello @msramalho, have you been able to run in Lambda ? I'm considering a similar setup

msramalho commented 7 months ago

Hey @kema-dev, no updates from my side but still eager to see how this progresses. Several changes have been made to the project since and I wonder if any (changes to the browser base) can make this issue easier to solve.

kema-dev commented 7 months ago

Hey, I tried a bit and didn't achieve a reasonable result. I switched to use ECS + Fargate + EFS, got not problem with this method

msramalho commented 7 months ago

Cool! Care to share any configurations or tips for replication?

kema-dev commented 7 months ago

Sure !

ikreymer commented 7 months ago

@kema-dev Thanks for sharing this! If there's a format that would make the most specify to specify this in (Terraform? Ansible playbook) or just as docs, happy to integrate this into the repo and/or our docs!

kema-dev commented 7 months ago

I personally use Pulumi, but it uses TF providers as backends anyway. Thoses resources are just AWS services that need to be provisioned, using Console, Ansible, TF, or Pulumi goes the same way.

I'm designing a complete solution with Event Bridge as scheduler and the ECS stuff I described above. Anyway, the core of the solution resides in my precedent message !