webrecorder / browsertrix

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!
https://browsertrix.com
GNU Affero General Public License v3.0
170 stars 32 forks source link

Supporting OpenShift #1090

Open ikreymer opened 1 year ago

ikreymer commented 1 year ago

Ideally, we'd have a way of running Browsertrix Cloud on the OpenShift flavor of K8s, but that'll likely require a few changes. Currently, we don't have the capacity to do this just yet, but would like to eventually support OpenShift. This is a placeholder issue to keep track of this.

We can start by listing known changes / requirements that will need to be made to support OpenShift in this issue.

wvengen commented 1 year ago

Actually, we're doing this already, patching the charts in the following ways:

The S3 storage proxy is not ideal, that could use some improvement. The issue was that our OpenStack Swift S3 does not allow HEAD requests on signed objects for CORS, which is required for ReplayWeb.page. Maybe this is not needed when using your own ReplayWeb.page hosted on the same domain (as we do with ingress rules), since then CORS shouldn't matter - but I did not verify this.

ikreymer commented 1 year ago

@wvengen That's great to hear, I'm surprised by the list below.

Would you mind opening a PR / sharing what you've done exactly so we can integrate into the main codebase?

One question we had was about namespaces - my understanding is that OpenShift has more constraints on namespace creation - are you using an existing namespace instead of crawlers?

Actually, we're doing this already, patching the charts in the following ways:

  • avoid putting secrets in YAML config (with instructions to manually create the secrets)

I commented more on #490 -- is that something that OpenShift requires or a decision you've made? On first glance, it doesn't seem like OpenShift Secrets behave differently than k8s.

  • comment out the clusterIP: None in the mongo template

This is just to run the service as a headless service for the statefulset - a common pattern. Why was this needed?

  • proxy S3 storage with nginx (Swift doesn't support everything necessary for replay; see below)
  • hosting our own ReplayWeb.page with injection customizations

The S3 storage proxy is not ideal, that could use some improvement. The issue was that our OpenStack Swift S3 does not allow HEAD requests on signed objects for CORS, which is required for ReplayWeb.page. Maybe this is not needed when using your own ReplayWeb.page hosted on the same domain (as we do with ingress rules), since then CORS shouldn't matter - but I did not verify this.

ReplayWeb.page will fall back on GET requests if HEAD fails, it prefers HEAD to check the size, so this shouldn't be an issue either way.

wvengen commented 1 year ago

are you using an existing namespace instead of crawlers

No, we use crawlers. I don't remember whether we needed to create the namespace manually or not.

avoid putting secrets in YAML config (with instructions to manually create the secrets)

I commented more on https://github.com/webrecorder/browsertrix-cloud/issues/490 -- is that something that OpenShift requires or a decision you've made?

No, that was a decision we made. Good to see this discussion in #490.

comment out the clusterIP: None in the mongo template

This is just to run the service as a headless service for the statefulset - a common pattern. Why was this needed?

Good question, I would need to dive into this to figure out again. Reading about the headless service, it looks like it isn't necessary.

ReplayWeb.page will fall back on GET requests if HEAD fails, it prefers HEAD to check the size, so this shouldn't be an issue either way.

Is this a GET request with a Range header to determine the size? For many archives, getting the whole archive is just too much. I'd need to check if this suffers from the same issue or not.

Thanks for asking these questions to get more clarity on what is really necessary.

wvengen commented 7 months ago

I got round to reinstalling Browsertrix from scratch, on OpenStack. Most things work as is, in our case not replaying archives.

Our infrastructure provider provides OpenStack's Swift for object storage (which has S3 support). We could not get CORS to work here (HEAD nor GET, signed and public URLs), so we need to keep using a storage proxy - a bit of a hack, but it works.

Swift does support CORS to some extent, but not in our case, unfortunately, so I cannot say if this holds for all OpenStack users.

hy-tomas-terala commented 26 minutes ago

Adding on to this, I am an OpenShift admin with a user who wanted to use Browsertrix. I've got everything but the frontend and backend working on OpenShift 4.15. Seems to me like there should be some easy progress here. I am willing to send more logs if necessary and also to edit the helm-install to change the Image to a debug one.

OpenShift: 4.15.X Helm version: 1.11.7 helm values: everything default

frontend in a crashloop:

Nginx keeps restarting

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/00-browsertrix-nginx-init.sh
rm: cannot remove '/etc/nginx/conf.d/default.conf': Permission denied
local minio: replacing $LOCAL_MINIO_HOST with "local-minio.default:9000", $LOCAL_BUCKET with "btrix-data"
sed: couldn't open temporary file /etc/nginx/includes/sedTfRzBf: Permission denied
sed: couldn't open temporary file /etc/nginx/includes/sedB76DAl: Permission denied
mkdir: cannot create directory '/etc/nginx/resolvers/': Permission denied
/docker-entrypoint.d/00-browsertrix-nginx-init.sh: line 16: /etc/nginx/resolvers/resolvers.conf: No such file or directory
cat: /etc/nginx/resolvers/resolvers.conf: No such file or directory

This docs page says that adding this line in the Dockerfile should fix most issues:

RUN chgrp -R 0 /some/directory && chmod -R g=u /some/directory

backend

op container

The op container seems to be fine

[2024-09-20 07:47:55 +0000] [1] [INFO] Starting gunicorn 23.0.0
[2024-09-20 07:47:55 +0000] [1] [INFO] Listening at: http://0.0.0.0:8756 (1)
[2024-09-20 07:47:55 +0000] [1] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2024-09-20 07:47:55 +0000] [7] [INFO] Booting worker with pid: 7
[2024-09-20 07:47:59 +0000] [7] [INFO] Started server process [7]
[2024-09-20 07:47:59 +0000] [7] [INFO] Waiting for application startup.
[2024-09-20 07:47:59 +0000] [7] [INFO] Application startup complete.
crawler resources
cpu = 0.900 + 1 * 0.600 = 1.5
qa_cpu = 0.900 + 0 * 0.600 = 0.9
memory = 1073741824 + 1 * 805306368 = 1879048192
qa_memory = 1073741824 + 0 * 805306368 = 1073741824
max crawler memory size: 1879048192
profile browser resources
cpu = 0.900
memory = 1073741824
Pod Metrics Available: True
Auto-Resize Enabled False
10.12.208.2:46286 - "GET /healthz HTTP/1.1" 200
10.12.208.2:46286 - "GET /healthz HTTP/1.1" 200
10.12.208.2:46286 - "GET /healthz HTTP/1.1" 200
...

api-container

database stuck? The api-container keeps restarting.

[2024-09-20 08:59:23 +0000] [1] [INFO] Starting gunicorn 23.0.0
[2024-09-20 08:59:23 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1)
[2024-09-20 08:59:23 +0000] [1] [INFO] Using worker: uvicorn.workers.UvicornWorker
[2024-09-20 08:59:23 +0000] [7] [INFO] Booting worker with pid: 7
[2024-09-20 08:59:26 +0000] [7] [INFO] Started server process [7]
[2024-09-20 08:59:26 +0000] [7] [INFO] Waiting for application startup.
Waiting DB
[2024-09-20 08:59:28 +0000] [7] [INFO] Application startup complete.
10.12.208.2:48146 - "GET /healthzStartup HTTP/1.1" 503
10.12.208.2:41102 - "GET /healthzStartup HTTP/1.1" 503
....
Retrying, waiting for DB to be ready
10.12.208.2:41218 - "GET /healthzStartup HTTP/1.1" 503
...

Let me know if there is something else I can provide that helps with this.