When I use node-llama-cpp to run inference, cloudrun fails with a 503 error

MarioSimou commented 1 month ago

Issue description

When I use node-llama-cpp to run inference, cloudrun fails with a 503 error.

Expected Behavior

Run inference in cloudrun without any issues.

Actual Behavior

I have a simple microservice that exposes two HTTP endpoints. One endpoint is used to check the health of the service (/api/v1/healthcheck), and the other endpoint is used to run inference (/api/v1/analyze) using node-llama-cpp and a Hugging Face model.

When I deployed the service on Google Cloud Run, I could access the health check endpoint without any issues. However, when I called the analyze endpoint, the service was failing with a 503 error. Initially, I thought it was a configuration issue, so I tried all the steps mentioned here to fix it, but I had no luck.

Next, I tested the container's behavior on a different cloud provider by deploying it on AWS ECS Fargate. Unfortunately, the container was still failing. At that point, I wanted to check the logs of the Cloud Run service again. Fortunately, I noticed that the container was terminating with this warning Container terminated on signal 4, which stands for Illegal Instruction. This indicates that the CPU attempted to execute an instruction that the hardware capabilities do not allow.

Since I'm using node-llama-cpp to download and build llama.cpp binaries, I think we may be doing something wrong there that is not aligned with what Cloud Run expects. I'm not sure how to interpret this, but at this point, I'm exhausted.

Additional Notes:

The docker image uses node:iron-bookworm-slim base image, which is on amd64 architecture.
The container works fine locally.
Both versions, node-llama-cpp v2 and v3 fail in cloudrun.

Steps to reproduce

Repo

My Environment

Dependency	Version
Operating System
CPU	12th Gen Intel i7-1260P / Ubuntu Linux 20.04
Node.js version	20.x
Typescript version	5.x
`node-llama-cpp` version	2.x and 3.x

Additional Context

No response

Relevant Features Used

[ ] Metal support
[ ] CUDA support
[X] Grammar

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

giladgd commented 1 month ago

I have a few suggestions for things you can try:

Don’t use :slim or :alpine tags, just use :22 or :20, as the slim ones don’t include all the necessary libraries to compile correctly when the hardware has some types of GPUs or NPUs.
Try to run npx node-llama-cpp download inside of the container before your code is running, just to make sure it has nothing to do with your build process that happens before deploying the container.
From my experience, the Illegal Instruction issue happens when the container runs inside of virtualization (for running x64 container on an arm64 machine, for example), since llama.cpp uses some instructions that are not commonly used but help maximize the performance of the hardware you have, but not all of those instructions are supported by the virtualization implementation used by docker for example.

MarioSimou commented 1 month ago

I tried all the above cases, and none of them worked. However, while I was trying to create a repo for you to use, I noticed a couple of things:

When I deployed the service from a machine that was on an amd64 architecture and used an AMD Ryzen 7 PRO 6850U with Radeon Graphics processor, the service didn't return a 503 error.
When I deployed the service from a machine that was on an amd64 architecture and used a 12th Gen Intel(R) Core(TM) i7-1260P processor, the service returned a 503 error.

So, the issue is definitely CPU-related.

I have also created the same service using the llama-cpp-python SDK, and I encountered the same problem there. At this point, the issue is not related to this repository, so I will be closing it soon. However, if you have any suggestions or ideas on how to solve this issue, feel free to share them with me.

withcatai / node-llama-cpp