zeno-ml / zeno-build

Build, evaluate, understand, and fix LLM-based apps
MIT License
482 stars 33 forks source link

Code: Examine accuracy differences in huggingface and vllm accuracy #162

Closed neubig closed 8 months ago

neubig commented 1 year ago

We're seeing different accuracies depending on whether we use the huggingface or vllm inference backends. Here are chrf evaluation scores for the chatbot example when using

huggingface:

eval=0.19579241963569372 param={"context_length": 4, "max_tokens": 100, "model_preset": "vicuna-7b", "prompt_preset": "standard", "temperature": 0.3, "top_p": 1.0}

vllm:

eval=0.12566133684961223 param={"context_length": 4, "max_tokens": 100, "model_preset": "vicuna-7b", "prompt_preset": "standard", "temperature": 0.3, "top_p": 1.0}

Commands to reproduce the result (after #161 is merged) are as follows:

python -u -m examples.chatbot.main --results-dir results_huggingface --models vicuna-7b --single-model vicuna-7b --prompts standard --single-prompt standard --experiments prompt --hf-inference-method huggingface
python -u -m examples.chatbot.main --results-dir results_vllm --models vicuna-7b --single-model vicuna-7b --prompts standard --single-prompt standard --experiments prompt --hf-inference-method vllm