Closed jansol closed 2 months ago
On MacOS (64GB M3 Max) this works for me as expected. On Linux (64GB no GPU) this gives the following error:
level=WARN source=server.go:134 msg="model request too large for system" requested="105.2 GiB" available=67106545664 total="62.5 GiB" free="54.5 GiB" swap="8.0 GiB"
level=INFO source=sched.go:429 msg="NewLlamaServer failed" model=/usr/share/ollama/.ollama/models/blobs/sha256-8eeb52dfb3bb9aefdf9d1ef24b3bdbcfbe82238798c4b918278320b6fcef18fe error="model requires more system memory (105.2 GiB) than is available (62.5 GiB)"
Can you check your logs (journalctl -u ollama
) and paste what you find (I assume you'll see something similar).
I'm going to take a crack at
In the meantime, I think I'll tone-down the defaults to which were introduced in:
So the default fit in something more commonly available (8/16GB) but still potentially higher than the 2048 default where appropriate. Alternatively I might just revert.
Thanks for reporting.
MacOS 16GB M2.
Zed threw this error after entering a small prompt with no context.
It doesn't throw an error for me, it just segfaults:
ollama[1884]: ggml_cuda_init: found 1 ROCm devices:
ollama[1884]: Device 0: Radeon Vega Frontier Edition, compute capability 9.0, VMM: no
ollama[1884]: llm_load_tensors: ggml ctx size = 0.27 MiB
ollama[1884]: llm_load_tensors: offloading 8 repeating layers to GPU
ollama[1884]: llm_load_tensors: offloaded 8/33 layers to GPU
ollama[1884]: llm_load_tensors: ROCm0 buffer size = 936.25 MiB
ollama[1884]: llm_load_tensors: ROCm_Host buffer size = 3501.56 MiB
kernel: ollama_llama_se[104303]: segfault at 0 ip 00007e2941fea850 sp 00007ffd3cb380b8 error 4 in libggml.so[7e2941f93000+225000] likely on CPU 9 (core 4, socket 0)
kernel: Code: f8 ab e8 48 8d 15 bc 16 ab e8 48 8d 0d be 08 ac e8 be 7e 01 00 00 31 c0 e8 2d b8 1c 00 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 <48> 8b 07 48 8b 80 a0 00 00 00 48 85 c0 74 02 ff e0 50 48 8d 3d f4
ollama[1884]: time=2024-08-29T10:57:41.489+03:00 level=INFO source=server.go:625 msg="waiting for server to become available" status="llm server error"
ollama[1884]: time=2024-08-29T10:57:41.739+03:00 level=ERROR source=sched.go:456 msg="error loading llama server" error="llama runner process has terminated: signal: segmentation fault (core dumped)"
ollama[1884]: [GIN] 2024/08/29 - 10:57:41 | 500 | 3.065869135s | 127.0.0.1 | POST "/api/chat"
Ah, ollama run
also crashes if I do /set parameter num_ctx 131072
.
If I halve it to 65536 it works fine:
ollama[1884]: ggml_cuda_init: found 1 ROCm devices:
ollama[1884]: Device 0: Radeon Vega Frontier Edition, compute capability 9.0, VMM: no
ollama[1884]: llm_load_tensors: ggml ctx size = 0.27 MiB
ollama[1884]: llm_load_tensors: offloading 27 repeating layers to GPU
ollama[1884]: llm_load_tensors: offloaded 27/33 layers to GPU
ollama[1884]: llm_load_tensors: ROCm0 buffer size = 3159.85 MiB
ollama[1884]: llm_load_tensors: CPU buffer size = 4437.80 MiB
ollama[1884]: llama_new_context_with_model: n_ctx = 65536
ollama[1884]: llama_new_context_with_model: n_batch = 512
ollama[1884]: llama_new_context_with_model: n_ubatch = 512
ollama[1884]: llama_new_context_with_model: flash_attn = 0
ollama[1884]: llama_new_context_with_model: freq_base = 500000.0
ollama[1884]: llama_new_context_with_model: freq_scale = 1
ollama[1884]: llama_kv_cache_init: ROCm0 KV buffer size = 6912.00 MiB
ollama[1884]: llama_kv_cache_init: ROCm_Host KV buffer size = 1280.00 MiB
ollama[1884]: llama_new_context_with_model: KV self size = 8192.00 MiB, K (f16): 4096.00 MiB, V (f16): 4096.00 MiB
ollama[1884]: llama_new_context_with_model: ROCm_Host output buffer size = 0.50 MiB
ollama[1884]: llama_new_context_with_model: ROCm0 compute buffer size = 4504.00 MiB
ollama[1884]: llama_new_context_with_model: ROCm_Host compute buffer size = 136.01 MiB
ollama[1884]: llama_new_context_with_model: graph nodes = 1030
ollama[1884]: llama_new_context_with_model: graph splits = 69
ollama[111909]: INFO [main] model loaded | tid="129528388342592" timestamp=1724919052
ollama[1884]: time=2024-08-29T11:10:52.385+03:00 level=INFO source=server.go:630 msg="llama runner started in 4.63 seconds"
32GiB system RAM, 16GiB GPU VRAM
EDIT: nevermind, building Zed with the limit set to 65536 and trying actually prompt it with a code file makes ollama throw CUDA error: out of memory
which I'm not sure how it's getting passed to Zed but it shows up as "Unable to parse chat response".
The fix is now available in Zed Preview. By default we are now way less aggressive about model size and will never do anything >16384 out of the box.
Can you update your Zed Preview and see whether this fixes your issues with llama3.1:latest
crashing.
If you have the hardware to handle 65536
this is now supported via settings.json:
{
"language_models": {
"ollama": {
"available_models": [
{
"provider": "ollama",
"name": "llama3.1:latest",
"max_tokens": 65536
}
]
}
}
}
I've also updated the Zed Ollama Configuration Docs. Thanks for reporting!
Yep, with the 16k default it works fine ootb! And with 65536 it seems to work too, although it's a bit slow since the total memory use of the model ends up being around 18GB so it doesn't fit entirely on the GPU. 32492 looks like a good compromise at 11GB.
Check for existing issues
Describe the bug / provide steps to reproduce it
After #16877 any prompt to ollama with llama3.1:latest crashes the ollama runner (ROCm). This does not happen with
ollama run llama3.1:latest
which uses the default token limit of 2048.Environment
Zed: v0.151.0 (Zed Dev a5b82b2bf3b54ec210bb293cf541eb4c6164824b) OS: Linux Wayland ubuntu 24.04 Memory: 31.3 GiB Architecture: x86_64 GPU: Radeon Vega Frontier Edition (RADV VEGA10) || radv || Mesa 24.0.9-0ubuntu0.1
If applicable, add mockups / screenshots to help explain present your vision of the feature
No response
If applicable, attach your Zed.log file to this issue.
No response