make browsertrix-crawler runnable in serverless environments

msramalho commented 11 months ago

Hi all,

I've been experimenting with making an AWS lambda function for browsertrix-crawler and I've gone some distance but hit a snag that the maintainers are probably better equipped to help with.

The problem is: AWS lambda function environment (I'm guessing other serverless options are similar) runs in a controlled environment where the only write permission to the /tmp directory and no other. For browsertrix-crawler outputs the --cwd option should solve it but it's still trying to write to .local (maybe that's playwright/redis or some other dependency?).

So the current issue error I get is:

mkdir: cannot create directory ‘/.local’: Read-only file system
touch: cannot touch '/.local/share/applications/mimeapps.list': No such file or directory
/usr/bin/google-chrome: line 45: /dev/fd/63: No such file or directory
/usr/bin/google-chrome: line 46: /dev/fd/63: No such file or directory
{
    "logLevel": "warn",
    "context": "redis",
    "message": "ioredis error",
    "details": {
        "error": "[ioredis] Unhandled error event:"
    }
}
{
    "logLevel": "warn",
    "context": "state",
    "message": "Waiting for redis at redis://localhost:6379/0",
    "details": {}
}
{
    "logLevel": "error",
    "context": "general",
    "message": "Crawl failed",
    "details": {
        "type": "exception",
        "message": "Timed out after 30000 ms while waiting for the WS endpoint URL to appear in stdout!",
        "stack": "TimeoutError: Timed out after 30000 ms while waiting for the WS endpoint URL to appear in stdout!\n    at ChromeLauncher.launch (file:///app/node_modules/puppeteer-core/lib/esm/puppeteer/node/ProductLauncher.js:123:23)\n    at async Browser._init (file:///app/util/browser.js:236:20)\n    at async Browser.launch (file:///app/util/browser.js:61:5)\n    at async Crawler.crawl (file:///app/crawler.js:821:5)\n    at async Crawler.run (file:///app/crawler.js:311:7)"
    }
}

and this is the version info

{
    "logLevel": "info",
    "context": "general",
    "message": "Browsertrix-Crawler 0.11.2 (with warcio.js 1.6.2 pywb 2.7.4)",
    "details": {}
}

I've put the Dockerfile and lambda_function.py in this gist you can use it if you want to replicate the issue.

For reference, I'm following these instructions: https://docs.aws.amazon.com/lambda/latest/dg/python-image.html And I'm using the API gateway to make testing quick:

tw4l commented 11 months ago

Thanks for flagging this!

mkdir: cannot create directory ‘/.local’: Read-only file system touch: cannot touch '/.local/share/applications/mimeapps.list': No such file or directory /usr/bin/google-chrome: line 45: /dev/fd/63: No such file or directory /usr/bin/google-chrome: line 46: /dev/fd/63: No such file or directory

Hm, I believe that these errors are from the browser itself, not necessarily Puppeteer. From some quick looking around, it looks like Chromium/Chrome/Brave may need to be built in a slightly different way to be able to run on AWS Lambda. We could probably accomplish this by having a separate browser base for Lambda, or perhaps the changes necessary could just be folded into the main release.

msramalho commented 11 months ago

Thanks, it makes sense it's chrome accessing those dirs.

In that case, a separate base would be the ideal scenario.

Is it possible that all the changes needed can be accommodate by chrome flags that we could already configure with CHROME_FLAGS as described in the README?

This (2yo) medium post points to a set of flags needed for chrome to run in lambda:

const chromeFlags = ['--no-xshm','--disable-dev-shm-usage','--single-process',
'--no-sandbox','--no-first-run',`--load-extension=${extensionDir}`]

// and then actually just 
'--no-first-run'

trying to gather whether it's worth testing that or if it has no future.

tw4l commented 11 months ago

Is it possible that all the changes needed can be accommodate by chrome flags that we could already configure with CHROME_FLAGS as described in the README?

It is possible! Tbh I'd have to dig deeper into it myself to say either way. It's also worth noting that current releases of the crawler are built on Brave Browser (see #189 for rationale), though it's still possible to build the crawler on Chrome/Chromium via the older debs in the https://github.com/webrecorder/browsertrix-browser-base repo.

If you're willing to put some time into investigating this I'd be happy to help/review a PR!

kema-dev commented 7 months ago

Hello @msramalho, have you been able to run in Lambda ? I'm considering a similar setup

msramalho commented 7 months ago

Hey @kema-dev, no updates from my side but still eager to see how this progresses. Several changes have been made to the project since and I wonder if any (changes to the browser base) can make this issue easier to solve.

kema-dev commented 7 months ago

Hey, I tried a bit and didn't achieve a reasonable result. I switched to use ECS + Fargate + EFS, got not problem with this method

msramalho commented 7 months ago

Cool! Care to share any configurations or tips for replication?

kema-dev commented 7 months ago

Sure !

ECS Cluster
Fargate Capacity Provider
VPC
- 3 Public Subnets
- 3 Private Subnets
EFS
- Volume for profiles
- Mount targets in each private subnet of VPC
Security Groups
- Browsertrix Profile creation
- Ingress
  - tcp/6080
  - tcp/9223
- Egress
  - tcp/all
- Browsertrix Crawling
- Egress
  - tcp/all
- EFS
- Profile creation
  - Ingress
  - tcp/2049 from Browsertrix Profile creation SG
  - Egress
  - tcp/all
- Crawling
  - Ingress
  - tcp/2049 from Browsertrix Crawling SG
  - Egress
  - tcp/all
S3 bucket for crawling results

IAM

Browsertrix crawling role

S3 bucket access

"Statement": [
{
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
"Resource": "<s3 bucket arn>/*",
},
{
"Effect": "Allow",
"Action": ["s3:ListBucket", "s3:GetBucketLocation"],
"Resource": "<s3 bucket arn>",
},
],

IAM User access to the Role (as I did not achieve to make it work with the ECS Task Role to be assumed by the ECS Task)
Access key for the user

Secrets Manager for Access Key Id and Access Key Secret

ECS Task role (actually 2 roles, one for Browsertrix Profile creation and one for Browsertrix Crawling)

Condition

"Statement": [
{
  "Effect": "Allow",
  "Principal": {
    "Service": ["ecs-tasks.amazonaws.com"],
  },
  "Action": "sts:AssumeRole",
  "Condition": {
    "ArnLike": {
      "aws:SourceArn": "arn:aws:ecs:<awsRegion>:<awsAccountId>:*",
    },
    "StringEquals": {
      "aws:SourceAccount": awsAccountId,
    },
  },
},
],

Policy (for Browsertrix Crawling, remove Secrets Manager access for Browsertrix Profile creation Role)

"Statement": [
{
"Effect": "Allow",
"Action": [
  "ecr:GetAuthorizationToken",
  "ecr:BatchCheckLayerAvailability",
  "ecr:GetDownloadUrlForLayer",
  "ecr:BatchGetImage",
  "logs:CreateLogStream",
  "logs:PutLogEvents",
  "logs:CreateLogGroup",
],
"Resource": "*", // Needs further restriction, suitable for development only
},
{
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": <keyId>,
},
{
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": <keySecret>,
},
],

ECS Task Role (for ECS Exec)

Condition

"Statement": [
{
"Effect": "Allow",
"Principal": {
  "Service": ["ecs-tasks.amazonaws.com"],
},
"Action": "sts:AssumeRole",
"Condition": {
  "ArnLike": {
    "aws:SourceArn": "arn:aws:ecs:<awsRegion>:<awsAccountId>:*",
  },
  "StringEquals": {
    "aws:SourceAccount": awsAccountId,
  },
},
},
],

Policy (for ECS Exec)

"Statement": [
{
"Effect": "Allow",
"Action": [
  "ssmmessages:CreateControlChannel",
  "ssmmessages:CreateDataChannel",
  "ssmmessages:OpenControlChannel",
  "ssmmessages:OpenDataChannel",
],
"Resource": "*",
},
],

ECS Task

"container": {
"name": "<as you wish>",
"memory": 2048,
"cpu": 1024,
"entryPoint": ["<as you wish>"],
"command": [
"<as>",
"<you>",
"<wish>",
],
"environment": [
{
  "name": "STORE_ENDPOINT_URL",
  "value": "<s3 url>"
},
{
  "name": "STORE_FILENAME",
  "value": "<as you wish>",
},
{
  "name": "STORE_PATH",
  "value": "<as you wish>",
},
],
"secrets": [
{
  "name": "STORE_ACCESS_KEY",
  "valueFrom": "<iam access key arn>",
},
{
  "name": "STORE_SECRET_KEY",
  "valueFrom": "<iam secret key arn>",
},
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
  "awslogs-create-group"":" "true",
  "awslogs-group"":" "<as you wish>",
  "awslogs-region"":" "<awsRegion>",
  "awslogs-stream-prefix"":" "ecs",
},
},
"mountPoints": [
{
  "containerPath": "/crawls/profiles",
  "sourceVolume": "<EFS volume name>",
  "readOnly": false,
},
],
},
"volumes": [
{
"name": "<EFS volume name>",
"efsVolumeConfiguration": {
  "fileSystemId": "<EFS file system id>",
  "transitEncryption": "ENABLED",
},
},
],
"runtimePlatform": {
"operatingSystemFamily": "LINUX",
"cpuArchitecture": "ARM64", // FinOps
},
"skipDestroy": false,
"executionRole": {
"roleArn": "<ecsTaskRoleArn>",
},
"taskRole": {
"roleArn": "<ecsTaskRoleArn>",
},
"logGroup": {
"args": {
"name": "<as you wish>",
"retentionInDays": <as you wish>,
"tags": {
  <as you wish>
},
},
},
"tags": {
<as you wish>
},

ikreymer commented 7 months ago

@kema-dev Thanks for sharing this! If there's a format that would make the most specify to specify this in (Terraform? Ansible playbook) or just as docs, happy to integrate this into the repo and/or our docs!

kema-dev commented 7 months ago

I personally use Pulumi, but it uses TF providers as backends anyway. Thoses resources are just AWS services that need to be provisioned, using Console, Ansible, TF, or Pulumi goes the same way.

I'm designing a complete solution with Event Bridge as scheduler and the ECS stuff I described above. Anyway, the core of the solution resides in my precedent message !

webrecorder / browsertrix-crawler

make browsertrix-crawler runnable in serverless environments #448