Open msramalho opened 11 months ago
Thanks for flagging this!
mkdir: cannot create directory ‘/.local’: Read-only file system touch: cannot touch '/.local/share/applications/mimeapps.list': No such file or directory /usr/bin/google-chrome: line 45: /dev/fd/63: No such file or directory /usr/bin/google-chrome: line 46: /dev/fd/63: No such file or directory
Hm, I believe that these errors are from the browser itself, not necessarily Puppeteer. From some quick looking around, it looks like Chromium/Chrome/Brave may need to be built in a slightly different way to be able to run on AWS Lambda. We could probably accomplish this by having a separate browser base for Lambda, or perhaps the changes necessary could just be folded into the main release.
Thanks, it makes sense it's chrome accessing those dirs.
In that case, a separate base would be the ideal scenario.
Is it possible that all the changes needed can be accommodate by chrome flags that we could already configure with CHROME_FLAGS
as described in the README?
This (2yo) medium post points to a set of flags needed for chrome to run in lambda:
const chromeFlags = ['--no-xshm','--disable-dev-shm-usage','--single-process',
'--no-sandbox','--no-first-run',`--load-extension=${extensionDir}`]
// and then actually just
'--no-first-run'
trying to gather whether it's worth testing that or if it has no future.
Is it possible that all the changes needed can be accommodate by chrome flags that we could already configure with
CHROME_FLAGS
as described in the README?
It is possible! Tbh I'd have to dig deeper into it myself to say either way. It's also worth noting that current releases of the crawler are built on Brave Browser (see #189 for rationale), though it's still possible to build the crawler on Chrome/Chromium via the older debs in the https://github.com/webrecorder/browsertrix-browser-base repo.
If you're willing to put some time into investigating this I'd be happy to help/review a PR!
Hello @msramalho, have you been able to run in Lambda ? I'm considering a similar setup
Hey @kema-dev, no updates from my side but still eager to see how this progresses. Several changes have been made to the project since and I wonder if any (changes to the browser base) can make this issue easier to solve.
Hey, I tried a bit and didn't achieve a reasonable result. I switched to use ECS + Fargate + EFS, got not problem with this method
Cool! Care to share any configurations or tips for replication?
Sure !
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject", "s3:ListBucket"],
"Resource": "<s3 bucket arn>/*",
},
{
"Effect": "Allow",
"Action": ["s3:ListBucket", "s3:GetBucketLocation"],
"Resource": "<s3 bucket arn>",
},
],
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": ["ecs-tasks.amazonaws.com"],
},
"Action": "sts:AssumeRole",
"Condition": {
"ArnLike": {
"aws:SourceArn": "arn:aws:ecs:<awsRegion>:<awsAccountId>:*",
},
"StringEquals": {
"aws:SourceAccount": awsAccountId,
},
},
},
],
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"logs:CreateLogStream",
"logs:PutLogEvents",
"logs:CreateLogGroup",
],
"Resource": "*", // Needs further restriction, suitable for development only
},
{
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": <keyId>,
},
{
"Effect": "Allow",
"Action": ["secretsmanager:GetSecretValue"],
"Resource": <keySecret>,
},
],
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": ["ecs-tasks.amazonaws.com"],
},
"Action": "sts:AssumeRole",
"Condition": {
"ArnLike": {
"aws:SourceArn": "arn:aws:ecs:<awsRegion>:<awsAccountId>:*",
},
"StringEquals": {
"aws:SourceAccount": awsAccountId,
},
},
},
],
"Statement": [
{
"Effect": "Allow",
"Action": [
"ssmmessages:CreateControlChannel",
"ssmmessages:CreateDataChannel",
"ssmmessages:OpenControlChannel",
"ssmmessages:OpenDataChannel",
],
"Resource": "*",
},
],
"container": {
"name": "<as you wish>",
"memory": 2048,
"cpu": 1024,
"entryPoint": ["<as you wish>"],
"command": [
"<as>",
"<you>",
"<wish>",
],
"environment": [
{
"name": "STORE_ENDPOINT_URL",
"value": "<s3 url>"
},
{
"name": "STORE_FILENAME",
"value": "<as you wish>",
},
{
"name": "STORE_PATH",
"value": "<as you wish>",
},
],
"secrets": [
{
"name": "STORE_ACCESS_KEY",
"valueFrom": "<iam access key arn>",
},
{
"name": "STORE_SECRET_KEY",
"valueFrom": "<iam secret key arn>",
},
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-create-group"":" "true",
"awslogs-group"":" "<as you wish>",
"awslogs-region"":" "<awsRegion>",
"awslogs-stream-prefix"":" "ecs",
},
},
"mountPoints": [
{
"containerPath": "/crawls/profiles",
"sourceVolume": "<EFS volume name>",
"readOnly": false,
},
],
},
"volumes": [
{
"name": "<EFS volume name>",
"efsVolumeConfiguration": {
"fileSystemId": "<EFS file system id>",
"transitEncryption": "ENABLED",
},
},
],
"runtimePlatform": {
"operatingSystemFamily": "LINUX",
"cpuArchitecture": "ARM64", // FinOps
},
"skipDestroy": false,
"executionRole": {
"roleArn": "<ecsTaskRoleArn>",
},
"taskRole": {
"roleArn": "<ecsTaskRoleArn>",
},
"logGroup": {
"args": {
"name": "<as you wish>",
"retentionInDays": <as you wish>,
"tags": {
<as you wish>
},
},
},
"tags": {
<as you wish>
},
@kema-dev Thanks for sharing this! If there's a format that would make the most specify to specify this in (Terraform? Ansible playbook) or just as docs, happy to integrate this into the repo and/or our docs!
I personally use Pulumi, but it uses TF providers as backends anyway. Thoses resources are just AWS services that need to be provisioned, using Console, Ansible, TF, or Pulumi goes the same way.
I'm designing a complete solution with Event Bridge as scheduler and the ECS stuff I described above. Anyway, the core of the solution resides in my precedent message !
Hi all,
I've been experimenting with making an AWS lambda function for browsertrix-crawler and I've gone some distance but hit a snag that the maintainers are probably better equipped to help with.
The problem is: AWS lambda function environment (I'm guessing other serverless options are similar) runs in a controlled environment where the only write permission to the
/tmp
directory and no other. For browsertrix-crawler outputs the--cwd
option should solve it but it's still trying to write to.local
(maybe that's playwright/redis or some other dependency?).So the current issue error I get is:
and this is the version info
I've put the
Dockerfile
andlambda_function.py
in this gist you can use it if you want to replicate the issue.For reference, I'm following these instructions: https://docs.aws.amazon.com/lambda/latest/dg/python-image.html And I'm using the API gateway to make testing quick: