Local inference with llama.cpp

zhangir-azerbayev commented 1 year ago

This PR adds support for an inference server powered by llama.cpp that runs efficiently on CPU. I am able to get a tactic suggestion in a few seconds on my 5 year old thinkpad.

See the additions to the README for installation and configuration instructions.

Unfortunately, the GPTNeoX architecture of the official llmstep model isn't supported by the llama.cpp library. I threw together quick finetune of open_llama_3b_v2, which you can find here: open-llama-3b_next-tactic_dev0.2. Note this model is extremely undertrained (less than one A100-hour), so should be viewed only as a proof of concept that inference latency on CPU can be acceptable. Training an open-llama that matches the llmstep model will be quite trivial.

A substantial refactor of python/train/tune.py was required train the open llama. But that code is still incredibly messy and I will PR it once I clean it up a bit.

The reasons this PR is a draft are:

The inference server crashes if you move around your cursor too much after typing llmstep "". For now it is best to not move your cursor after typing the empty string, and actually putting text in the string almost always causes a crash.
Currently, this implementation only supports generating one tactic step per llmstep call. The CPU won't handle parallel decoding well like the GPU can, but we might want to add a "get another suggestion" button or something of the sort.
We probably want to train the better model compatible with llama.cpp before merging into main.

yangky11 commented 1 year ago

Just wondering if the "one tactic suggestion per" is a restriction of llama.cpp? Is it possible to use beam search or other decoding algorithms to get multiple tactics?

zhangir-azerbayev commented 1 year ago

Just wondering if the "one tactic suggestion per" is a restriction of llama.cpp? Is it possible to use beam search or other decoding algorithms to get multiple tactics?

Based on this section of the llama.cpp docs, it appears to be a restriction of llama.cpp. In fact, I would suspect it's less a restriction of llama.cpp and more a fundamental limitation of CPU inference, since CPUs probably don't have the FLOPs to do batch level parallelism.

zhangir-azerbayev commented 1 year ago

I trained a new tactic generator for 15 epochs on the llmstep training set. In principle, this model should be competitive with the llmstep model but I haven't extensively checked the training code for bugs that may be impacting performance.

kim-em commented 1 year ago

I got this running with the v0.3 model, quantized as q4_0. Unfortunately it is rather unsuccessful on the Examples.lean.

kim-em commented 1 year ago

I only saw one server crash, a segfault.

wellecks / llmstep

Local inference with llama.cpp #5