Zed crashes the ollama runner

jansol commented 2 months ago

Check for existing issues

[X] Completed

Describe the bug / provide steps to reproduce it

After #16877 any prompt to ollama with llama3.1:latest crashes the ollama runner (ROCm). This does not happen with ollama run llama3.1:latest which uses the default token limit of 2048.

Environment

Zed: v0.151.0 (Zed Dev a5b82b2bf3b54ec210bb293cf541eb4c6164824b) OS: Linux Wayland ubuntu 24.04 Memory: 31.3 GiB Architecture: x86_64 GPU: Radeon Vega Frontier Edition (RADV VEGA10) || radv || Mesa 24.0.9-0ubuntu0.1

If applicable, add mockups / screenshots to help explain present your vision of the feature

No response

If applicable, attach your Zed.log file to this issue.

No response

notpeter commented 2 months ago

On MacOS (64GB M3 Max) this works for me as expected. On Linux (64GB no GPU) this gives the following error:

level=WARN source=server.go:134 msg="model request too large for system" requested="105.2 GiB" available=67106545664 total="62.5 GiB" free="54.5 GiB" swap="8.0 GiB"
level=INFO source=sched.go:429 msg="NewLlamaServer failed" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe error="model requires more system memory (105.2 GiB) than is available (62.5 GiB)"

Can you check your logs (journalctl -u ollama) and paste what you find (I assume you'll see something similar).

I'm going to take a crack at

https://github.com/zed-industries/zed/issues/15200

In the meantime, I think I'll tone-down the defaults to which were introduced in:

https://github.com/zed-industries/zed/pull/16877

So the default fit in something more commonly available (8/16GB) but still potentially higher than the 2048 default where appropriate. Alternatively I might just revert.

Thanks for reporting.

pjv commented 2 months ago

MacOS 16GB M2.

Zed threw this error after entering a small prompt with no context.

jansol commented 2 months ago

It doesn't throw an error for me, it just segfaults:

ollama[1884]: ggml_cuda_init: found 1 ROCm devices:
ollama[1884]:   Device 0: Radeon Vega Frontier Edition, compute capability 9.0, VMM: no
ollama[1884]: llm_load_tensors: ggml ctx size =    0.27 MiB
ollama[1884]: llm_load_tensors: offloading 8 repeating layers to GPU
ollama[1884]: llm_load_tensors: offloaded 8/33 layers to GPU
ollama[1884]: llm_load_tensors:      ROCm0 buffer size =   936.25 MiB
ollama[1884]: llm_load_tensors:  ROCm_Host buffer size =  3501.56 MiB
kernel: ollama_llama_se[104303]: segfault at 0 ip 00007e2941fea850 sp 00007ffd3cb380b8 error 4 in libggml.so[7e2941f93000+225000] likely on CPU 9 (core 4, socket 0)
kernel: Code: f8 ab e8 48 8d 15 bc 16 ab e8 48 8d 0d be 08 ac e8 be 7e 01 00 00 31 c0 e8 2d b8 1c 00 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 <48> 8b 07 48 8b 80 a0 00 00 00 48 85 c0 74 02 ff e0 50 48 8d 3d f4
ollama[1884]: time=2024-08-29T10:57:41.489+03:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
ollama[1884]: time=2024-08-29T10:57:41.739+03:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
ollama[1884]: [GIN] 2024/08/29 - 10:57:41 | 500 |  3.065869135s |       127.0.0.1 | POST     "/api/chat"

jansol commented 2 months ago

Ah, ollama run also crashes if I do /set parameter num_ctx 131072.

If I halve it to 65536 it works fine:

ollama[1884]: ggml_cuda_init: found 1 ROCm devices:
ollama[1884]:   Device 0: Radeon Vega Frontier Edition, compute capability 9.0, VMM: no
ollama[1884]: llm_load_tensors: ggml ctx size =    0.27 MiB
ollama[1884]: llm_load_tensors: offloading 27 repeating layers to GPU
ollama[1884]: llm_load_tensors: offloaded 27/33 layers to GPU
ollama[1884]: llm_load_tensors:      ROCm0 buffer size =  3159.85 MiB
ollama[1884]: llm_load_tensors:        CPU buffer size =  4437.80 MiB
ollama[1884]: llama_new_context_with_model: n_ctx      = 65536
ollama[1884]: llama_new_context_with_model: n_batch    = 512
ollama[1884]: llama_new_context_with_model: n_ubatch   = 512
ollama[1884]: llama_new_context_with_model: flash_attn = 0
ollama[1884]: llama_new_context_with_model: freq_base  = 500000.0
ollama[1884]: llama_new_context_with_model: freq_scale = 1
ollama[1884]: llama_kv_cache_init:      ROCm0 KV buffer size =  6912.00 MiB
ollama[1884]: llama_kv_cache_init:  ROCm_Host KV buffer size =  1280.00 MiB
ollama[1884]: llama_new_context_with_model: KV self size  = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB
ollama[1884]: llama_new_context_with_model:  ROCm_Host  output buffer size =     0.50 MiB
ollama[1884]: llama_new_context_with_model:      ROCm0 compute buffer size =  4504.00 MiB
ollama[1884]: llama_new_context_with_model:  ROCm_Host compute buffer size =   136.01 MiB
ollama[1884]: llama_new_context_with_model: graph nodes  = 1030
ollama[1884]: llama_new_context_with_model: graph splits = 69
ollama[111909]: INFO [main] model loaded | tid="129528388342592" timestamp=1724919052
ollama[1884]: time=2024-08-29T11:10:52.385+03:00 level=INFO source=server.go:630 msg="llama runner started in 4.63 seconds"

32GiB system RAM, 16GiB GPU VRAM

EDIT: nevermind, building Zed with the limit set to 65536 and trying actually prompt it with a code file makes ollama throw CUDA error: out of memory which I'm not sure how it's getting passed to Zed but it shows up as "Unable to parse chat response".

notpeter commented 2 months ago

The fix is now available in Zed Preview. By default we are now way less aggressive about model size and will never do anything >16384 out of the box.

Can you update your Zed Preview and see whether this fixes your issues with llama3.1:latest crashing.

If you have the hardware to handle 65536 this is now supported via settings.json:

{
  "language_models": {
    "ollama": {
      "available_models": [
        {
          "provider": "ollama",
          "name": "llama3.1:latest",
          "max_tokens": 65536
        }
      ]
    }
  }
}

I've also updated the Zed Ollama Configuration Docs. Thanks for reporting!

jansol commented 2 months ago

Yep, with the 16k default it works fine ootb! And with 65536 it seems to work too, although it's a bit slow since the total memory use of the model ends up being around 18GB so it doesn't fit entirely on the GPU. 32492 looks like a good compromise at 11GB.

zed-industries / zed