withcatai / node-llama-cpp

Run AI models locally on your machine with node.js bindings for llama.cpp. Force a JSON schema on the model output on the generation level
https://node-llama-cpp.withcat.ai
MIT License
835 stars 81 forks source link

When I use node-llama-cpp to run inference, cloudrun fails with a 503 error #277

Open MarioSimou opened 1 month ago

MarioSimou commented 1 month ago

Issue description

When I use node-llama-cpp to run inference, cloudrun fails with a 503 error.

Expected Behavior

Run inference in cloudrun without any issues.

Actual Behavior

I have a simple microservice that exposes two HTTP endpoints. One endpoint is used to check the health of the service (/api/v1/healthcheck), and the other endpoint is used to run inference (/api/v1/analyze) using node-llama-cpp and a Hugging Face model.

When I deployed the service on Google Cloud Run, I could access the health check endpoint without any issues. However, when I called the analyze endpoint, the service was failing with a 503 error. Initially, I thought it was a configuration issue, so I tried all the steps mentioned here to fix it, but I had no luck.

Next, I tested the container's behavior on a different cloud provider by deploying it on AWS ECS Fargate. Unfortunately, the container was still failing. At that point, I wanted to check the logs of the Cloud Run service again. Fortunately, I noticed that the container was terminating with this warning Container terminated on signal 4, which stands for Illegal Instruction. This indicates that the CPU attempted to execute an instruction that the hardware capabilities do not allow.

Since I'm using node-llama-cpp to download and build llama.cpp binaries, I think we may be doing something wrong there that is not aligned with what Cloud Run expects. I'm not sure how to interpret this, but at this point, I'm exhausted.

Additional Notes:

  1. The docker image uses node:iron-bookworm-slim base image, which is on amd64 architecture.
  2. The container works fine locally.
  3. Both versions, node-llama-cpp v2 and v3 fail in cloudrun.

Steps to reproduce

Repo

My Environment

Dependency Version
Operating System
CPU 12th Gen Intel i7-1260P / Ubuntu Linux 20.04
Node.js version 20.x
Typescript version 5.x
node-llama-cpp version 2.x and 3.x

Additional Context

No response

Relevant Features Used

Are you willing to resolve this issue by submitting a Pull Request?

Yes, I have the time, but I don't know how to start. I would need guidance.

giladgd commented 1 month ago

I have a few suggestions for things you can try:

MarioSimou commented 1 month ago

I tried all the above cases, and none of them worked. However, while I was trying to create a repo for you to use, I noticed a couple of things:

So, the issue is definitely CPU-related.

I have also created the same service using the llama-cpp-python SDK, and I encountered the same problem there. At this point, the issue is not related to this repository, so I will be closing it soon. However, if you have any suggestions or ideas on how to solve this issue, feel free to share them with me.