zalando / postgres-operator

Postgres operator creates and manages PostgreSQL clusters running in Kubernetes
https://postgres-operator.readthedocs.io/
MIT License
4.22k stars 968 forks source link

Postgres Operators fails to start on Minikube 1.26.0 with qemu2 driver on ARM64 #1940

Closed mprimeaux closed 1 year ago

mprimeaux commented 2 years ago

Please, answer some short questions which should help us to understand your problem / question better?

Some general remarks when posting a bug report:

I'm using Minikube 1.26.0 with the qemu2 driver on Apple M1 silicon and the operator fails with the following error:

postgres-operator exec /postgres-operator: exec format error

Using the latest PostgreSQL operator (1.8.2) works as expected on this same version of Minikube on Apple M1 silicon using the docker driver.

Unfortunately, the pod immediately terminates so I've been unable to gather any log files. Does the postgres-operator support ARM64?

weisdd commented 2 years ago

Seems like official builds are only for amd64: https://github.com/zalando/postgres-operator/blob/1c80ac0acd4fb15432e46d8dadac6f1bf4817d31/Makefile#L57-L59

mprimeaux commented 2 years ago

@weisdd Thanks for the link.

FWIW, this operator works as expected on my Apple M1 using Docker Desktop (Apple Silicon) as my driver for Minikube but does not work on the same machine with the only difference being the qemu2 driver for Minikube.

I'll dig into it a bit more but perhaps Rosetta2 is running it as AMD64 even though in an ARM64 VM.

mmoscher commented 1 year ago

@mprimeaux building the operator on an aarch64 (linux/arm64) machine (Google Cloud Tau T2A GCE Instance) worked out for me, i.e. customizing the Makefile+Dockerfile and overriding the operator's default image (helm chart values). Additionally, one has to use the custom, arm64 compatible, spilo image which is already available in the Zalando registry. Will test this tomorrow/next week on an Apple Silicon M1 processor. If the PoC works well-enough, I'll file a pull request.

mprimeaux commented 1 year ago

@mmoscher Thanks much! Please let me know if I can test the PR. Happy to help.

mprimeaux commented 1 year ago

@mmoscher Any updates on the arm64 support? Please let me know how (or if) I can help. I'll make time.

mmoscher commented 1 year ago

@mprimeaux spilo linux/arm64 support has been merged yesterday https://github.com/zalando/spilo/pull/790 and will be available with the next spilo tag (postgresql version >= 14 support only).

Now we can continue with the operator itself to get it linux/arm64 compatible. However, its baseimage registry.opensource.zalan.do/library/alpine-3.xx, is not yet available with linuxarm64 architecture in the zalando registry.

As mentioned in #2084 two options feasible. The second option, eg. hosting on ghcr.io, would be my favorite one to go with. Nevertheless, I'd no time yet to implement it. Maybe I've some free time at the end of the week/weekend.

For now, you can build it your self with some small changes: https://github.com/mmoscher/postgres-operator/pull/1/files

TL;DR: I'm still on it ;)

joepa37 commented 1 year ago

I have tried building postgres-operator making those changes

make deps
export TAG=$(git describe --tags --always --dirty)
make docker

But I am getting some errors

at make deps

GO111MODULE=on go mod tidy
github.com/zalando/postgres-operator/pkg/cluster imports
    k8s.io/client-go/rest imports
    k8s.io/client-go/plugin/pkg/client/auth/exec imports
    io/fs: malformed module path "io/fs": missing dot in first path element
make: *** [Makefile:90: tools] Error 1

at make docker

go: extracting github.com/emicklei/go-restful v2.9.5+incompatible
github.com/zalando/postgres-operator/pkg/cluster imports
    k8s.io/client-go/rest imports
    k8s.io/client-go/plugin/pkg/client/auth/exec imports
    io/fs: malformed module path "io/fs": missing dot in first path element
make: *** [Makefile:90: tools] Error 1
echo '{\n "url": "git:https://github.com/zalando/postgres-operator.git",\n "revision": "c895e8f6",\n "author": "root",\n "status": " M Makefile  M docker/Dockerfile  M go.mod"\n}' > scm-source.json
GOOS=linux GOARCH=arm64 CGO_ENABLED=0 go build -o build/linux/postgres-operator -v -ldflags "-X=main.version=v1.8.2-34-gc895e8f6-dirty" cmd/main.go
build command-line-arguments: cannot load io/fs: malformed module path "io/fs": missing dot in first path element
make: *** [Makefile:58: linux] Error 1
jonizen commented 1 year ago

I tried it out and i could build it and push the image without any problems. :) link to image

@joepa37 see if that works :)

joepa37 commented 1 year ago

@jonizen I can see your image is linux/amd64 OS/ARCH. So maybe these issues are related to go dependencies on linux/arm64 arch only.

jonizen commented 1 year ago

@joepa37 The things i have experienced when dealing with arm64 compiles is that you usually don't get it to work on WIN, i did a project with compiling percona xtrabackup for arm64 and the only way to get it to work, without a lot of tweaks, was to use wsl2 and use Ubuntu to run on my windows to build it in arm64 with buidkit. This image is built with my laptop running Ubuntu. So the code works, but i guess maybe you are on a win computer doing buildkit?

abangser commented 1 year ago

@joepa37 I had a similar problem building on my M1 and similarly ran into the situation where @jonizen's image was still the AMD arch.

My fix was to make 2 small changes to the code from @mmoscher (thank you!) to do a docker buildx command and build both arches.

  1. My new make target code was:

      docker: ${DOCKERDIR}/${DOCKERFILE} docker-context
          echo `(env)`
          echo "Tag ${TAG}"
          echo "Version ${VERSION}"
          echo "CDP tag ${CDP_TAG}"
          echo "git describe $(shell git describe --tags --always --dirty)"
          if ! docker buildx ls | grep -q "zalando-builder"; then \
              docker buildx create --name zalando-builder; \
          fi;
          cd "${DOCKERDIR}" && docker buildx build \
              --rm \
              --builder zalando-builder \
              --platform linux/arm64,linux/amd64 \
              --tag $(IMAGE):$(TAG)$(CDP_TAG)$(DEBUG_FRESH)$(DEBUG_POSTFIX) \
              --push \
              --file ${DOCKERFILE} \
              .
  2. I removed the hardcoding of the two ARGs in the dockerfile on lines 5 and 6 to be passed in.

These changes allowed me to run the following command:

IMAGE=my-repo/zalan-do-acid-postgres-operator make docker

but still got an error 😒

cd "docker" && docker buildx build \
    --rm \
    --builder zalando-builder \
    --platform linux/arm64,linux/amd64 \
    --tag syntasso/zalan-do-acid-postgres-operator:2880a58-dirty \
    --push \
    --file Dockerfile \
    .
[+] Building 16.3s (23/33)                                                                                                                                    
 => [internal] load build definition from Dockerfile                                                                                                     0.1s
 => => transferring dockerfile: 993B                                                                                                                     0.0s
 => [internal] load .dockerignore                                                                                                                        0.1s
 => => transferring context: 2B                                                                                                                          0.0s
 => [linux/arm64 internal] load metadata for docker.io/library/alpine:3.15                                                                               4.3s
 => [linux/amd64 internal] load metadata for docker.io/library/golang:1.17-alpine3.15                                                                    4.3s
 => [linux/amd64 internal] load metadata for docker.io/library/alpine:3.15                                                                               4.3s
 => [auth] library/alpine:pull token for registry-1.docker.io                                                                                            0.0s
 => [auth] library/golang:pull token for registry-1.docker.io                                                                                            0.0s
 => [linux/amd64 go-builder 1/8] FROM docker.io/library/golang:1.17-alpine3.15@sha256:543b0922baa147b87a568968462a9586e94b588426f51396a2666590cfba327a   0.2s
 => => resolve docker.io/library/golang:1.17-alpine3.15@sha256:543b0922baa147b87a568968462a9586e94b588426f51396a2666590cfba327a                          0.1s
 => [linux/amd64 postgres-operator 1/6] FROM docker.io/library/alpine:3.15@sha256:cf34c62ee8eb3fe8aa24c1fab45d7e9d12768d945c3f5a6fd6a63d901e898479       0.2s
 => => resolve docker.io/library/alpine:3.15@sha256:cf34c62ee8eb3fe8aa24c1fab45d7e9d12768d945c3f5a6fd6a63d901e898479                                     0.2s
 => [internal] load build context                                                                                                                       10.6s
 => => transferring context: 61.29MB                                                                                                                    10.4s
 => [linux/arm64 postgres-operator 1/6] FROM docker.io/library/alpine:3.15@sha256:cf34c62ee8eb3fe8aa24c1fab45d7e9d12768d945c3f5a6fd6a63d901e898479       0.2s
 => => resolve docker.io/library/alpine:3.15@sha256:cf34c62ee8eb3fe8aa24c1fab45d7e9d12768d945c3f5a6fd6a63d901e898479                                     0.2s
 => CACHED [linux/arm64 postgres-operator 2/6] RUN apk --no-cache add curl                                                                               0.0s
 => CACHED [linux/arm64 postgres-operator 3/6] RUN apk --no-cache add ca-certificates                                                                    0.0s
 => CACHED [linux/amd64 postgres-operator 2/6] RUN apk --no-cache add curl                                                                               0.0s
 => CACHED [linux/amd64 postgres-operator 3/6] RUN apk --no-cache add ca-certificates                                                                    0.0s
 => CACHED [linux/amd64 go-builder 2/8] WORKDIR /src                                                                                                     0.0s
 => CACHED [linux/amd64 go-builder 3/8] COPY . .                                                                                                         0.0s
 => CACHED [linux/amd64->arm64 go-builder 4/8] RUN go get -d k8s.io/client-go@kubernetes-1.22.4                                                          0.0s
 => CACHED [linux/amd64->arm64 go-builder 5/8] RUN go install github.com/golang/mock/mockgen@v1.6.0                                                      0.0s
 => ERROR [linux/amd64->arm64 go-builder 6/8] RUN go mod tidy                                                                                            1.1s
 => CACHED [linux/amd64 go-builder 4/8] RUN go get -d k8s.io/client-go@kubernetes-1.22.4                                                                 0.0s
 => CACHED [linux/amd64 go-builder 5/8] RUN go install github.com/golang/mock/mockgen@v1.6.0                                                             0.0s
 => ERROR [linux/amd64 go-builder 6/8] RUN go mod tidy                                                                                                   1.1s
------                                                                                                                                                        
 > [linux/amd64->arm64 go-builder 6/8] RUN go mod tidy:                                                                                                       
#0 0.850 go: go.mod file not found in current directory or any parent directory; see 'go help modules'
------
------
 > [linux/amd64 go-builder 6/8] RUN go mod tidy:
#0 0.897 go: go.mod file not found in current directory or any parent directory; see 'go help modules'
------
Dockerfile:15
--------------------
  13 |     RUN go get -d k8s.io/client-go@kubernetes-1.22.4
  14 |     RUN go install github.com/golang/mock/mockgen@v1.6.0
  15 | >>> RUN go mod tidy
  16 |     RUN go mod vendor
  17 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c go mod tidy" did not complete successfully: exit code: 1
make: *** [Makefile:74: docker] Error 1

This one feels more like something other than ARCH (edit: this is failing the same way on the branch with the docker changes but not on master branch when building on a M1 mac), but I may be missing how my changes impacted it. I will keep poking, but if anyone has an idea please let me know! Thanks πŸ™‡

jonizen commented 1 year ago

Yeah, I pushed the wrong image, but I got the arm64 one. So it works I believe 😊

Be aware that your output says β€œcached” if you cache a step that fails, you can have it all correct. Since this image builds fast, run --no-cache to rule that out.

I will have a look at this probably today after work 😊

abangser commented 1 year ago

Thanks for the update @jonizen! πŸ™‡

That is interesting that it did work for you. I ended up tracking down an issue where the Dockerfile has a COPY . . and that is paired with a line in the makefile that runs the docker command from the DOCKERDIR. This means that the only files available in the image are the files in the DOCKERDIR which is obviously not enough and doesn't include the go.mod.

I fixed this by using the following docker make target:

docker: ${DOCKERDIR}/${DOCKERFILE} docker-context
    echo `(env)`
    echo "Tag ${TAG}"
    echo "Version ${VERSION}"
    echo "CDP tag ${CDP_TAG}"
    echo "git describe $(shell git describe --tags --always --dirty)"
    if ! docker buildx ls | grep -q "zalando-builder"; then \
        docker buildx create --name zalando-builder; \
    fi;
    docker buildx build \
        --rm \
        --builder zalando-builder \
        --platform linux/arm64,linux/amd64 \
        --tag $(IMAGE):$(TAG)$(CDP_TAG)$(DEBUG_FRESH)$(DEBUG_POSTFIX) \
        --push \
        --file "${DOCKERDIR}/${DOCKERFILE}" \
        .

Which has resulted in this image (no guarantee of longevity of, or updates to, this image as we are currently only using it for a demo!).

While this worked for me, I have to say I am intrigued how you ended up getting yours building as I might be doing something too heavy handed. The image did take something like 20 minutes to build!

mmoscher commented 1 year ago

@abangser glad you found the solution yourself. As you mentioned, you've used the wrong docker-context to build the image.

The script I'm using to build a multiarch images is the following (and is located in another directory):

cd "/tmp"
echo "[INFO] Building postgresql operator ..."
git clone git@github.com:mmoscher/postgres-operator.git && pushd "postgres-operator"
git checkout arm64

docker buildx build \
        --push \
        --platform=linux/amd64,linux/arm64 \
        -t <private-repo-and-image-tag> \
        -f docker/Dockerfile \
        .
popd
rm -rf postgres-operator

However, I'm not yet using the makefile. @abangser would be awesome if you could file a PR with your change to my fork (https://github.com/mmoscher/postgres-operator/tree/arm64). Then we can work on from there and file a PR to this repo soon.

FYI: running this script on my Mac M1, using colima as docker backend, takes roughly 5m for the multiarch images to build (base images cached). However, 20m could be fine to (based on your hardware).

mprimeaux commented 1 year ago

@abangser Is there anything I can do to help progress this item? I'm very keen to have a version of the operator that works on ARM64.

abangser commented 1 year ago

Absolutely appreciate it would be helpful. As I mentioned in this PR, it seems to be working for me and is as far as I can/will take a commit at this time as I am not aware of where else to go. Please feel free to merge or of course rewrite if it isn't quite right.

https://github.com/mmoscher/postgres-operator/pull/2#issuecomment-1544079462

Thanks

mprimeaux commented 1 year ago

Thanks! I will test this out today and reply on the PR and here.

jonizen commented 1 year ago

I think the latest release already included this changes, but you have to specify the correct image.

Look at the latest release on the release page. I also think the other parts for pooling and backup is planed 😊

jonizen commented 1 year ago

Quoted from release page:

We are excited to announce a new release of the Postgres Operator. A rather small one but bringing you ARM support for the operator (pooler, ui and logical backup will follow). Thanks to everyone who contributed with PRs, feedback, raising issues or providing ideas.

New features

Provide Postgres-Operator as multi-arch image that can run on arm (#2268, #2127) .....

mprimeaux commented 1 year ago

@jonizen @abangser I can confirm the Postgres Operator successfully starts on ARM64 (Apple M1 and M2 CPUs) in Minikube with the QEMU driver. I ended up using the following stanza in the values file:

image:
  registry: ghcr.io
  repository: zalando/postgres-operator
  tag: v1.10.0
  pullPolicy: "IfNotPresent"

Is the intent to stick with GHCR for your container registry moving forward rather than defaulting to registry.opensource.zalan.do?

FxKu commented 1 year ago

Yes, we want to stick with ghcr for now. Do you see any problems with this @mprimeaux ? Just curious. This issue can be closed then, right?

mprimeaux commented 1 year ago

@FxKu No issues on my end at all. The reason I asked about GHCR was only because the values.yaml still refers to the previous registry, which I assume is for compatibility reasons (e.g. not all container images and versions are in GHCR yet).

I’ll close this issue. Thanks for all your help and support.

urashidmalik commented 11 months ago

Can we get the UI - postgres-operator-ui image for ARM64?