Code: Examine accuracy differences in huggingface and vllm accuracy

We're seeing different accuracies depending on whether we use the huggingface or vllm inference backends. Here are chrf evaluation scores for the chatbot example when using

huggingface:

eval=0.19579241963569372 param={"context_length": 4, "max_tokens": 100, "model_preset": "vicuna-7b", "prompt_preset": "standard", "temperature": 0.3, "top_p": 1.0}

vllm:

eval=0.12566133684961223 param={"context_length": 4, "max_tokens": 100, "model_preset": "vicuna-7b", "prompt_preset": "standard", "temperature": 0.3, "top_p": 1.0}

Commands to reproduce the result (after #161 is merged) are as follows:

python -u -m examples.chatbot.main --results-dir results_huggingface --models vicuna-7b --single-model vicuna-7b --prompts standard --single-prompt standard --experiments prompt --hf-inference-method huggingface
python -u -m examples.chatbot.main --results-dir results_vllm --models vicuna-7b --single-model vicuna-7b --prompts standard --single-prompt standard --experiments prompt --hf-inference-method vllm

zeno-ml / zeno-build

Code: Examine accuracy differences in huggingface and vllm accuracy #162