Open zhangir-azerbayev opened 1 year ago
Just wondering if the "one tactic suggestion per" is a restriction of llama.cpp? Is it possible to use beam search or other decoding algorithms to get multiple tactics?
Just wondering if the "one tactic suggestion per" is a restriction of llama.cpp? Is it possible to use beam search or other decoding algorithms to get multiple tactics?
Based on this section of the llama.cpp docs, it appears to be a restriction of llama.cpp. In fact, I would suspect it's less a restriction of llama.cpp and more a fundamental limitation of CPU inference, since CPUs probably don't have the FLOPs to do batch level parallelism.
I trained a new tactic generator for 15 epochs on the llmstep training set. In principle, this model should be competitive with the llmstep model but I haven't extensively checked the training code for bugs that may be impacting performance.
I got this running with the v0.3 model, quantized as q4_0. Unfortunately it is rather unsuccessful on the Examples.lean
.
I only saw one server crash, a segfault.
This PR adds support for an inference server powered by llama.cpp that runs efficiently on CPU. I am able to get a tactic suggestion in a few seconds on my 5 year old thinkpad.
See the additions to the README for installation and configuration instructions.
Unfortunately, the GPTNeoX architecture of the official llmstep model isn't supported by the llama.cpp library. I threw together quick finetune of open_llama_3b_v2, which you can find here: open-llama-3b_next-tactic_dev0.2. Note this model is extremely undertrained (less than one A100-hour), so should be viewed only as a proof of concept that inference latency on CPU can be acceptable. Training an open-llama that matches the llmstep model will be quite trivial.
A substantial refactor of
python/train/tune.py
was required train the open llama. But that code is still incredibly messy and I will PR it once I clean it up a bit.The reasons this PR is a draft are:
llmstep ""
. For now it is best to not move your cursor after typing the empty string, and actually putting text in the string almost always causes a crash.llmstep
call. The CPU won't handle parallel decoding well like the GPU can, but we might want to add a "get another suggestion" button or something of the sort.