Closed neubig closed 8 months ago
We're seeing different accuracies depending on whether we use the huggingface or vllm inference backends. Here are chrf evaluation scores for the chatbot example when using
huggingface:
eval=0.19579241963569372 param={"context_length": 4, "max_tokens": 100, "model_preset": "vicuna-7b", "prompt_preset": "standard", "temperature": 0.3, "top_p": 1.0}
vllm:
eval=0.12566133684961223 param={"context_length": 4, "max_tokens": 100, "model_preset": "vicuna-7b", "prompt_preset": "standard", "temperature": 0.3, "top_p": 1.0}
Commands to reproduce the result (after #161 is merged) are as follows:
python -u -m examples.chatbot.main --results-dir results_huggingface --models vicuna-7b --single-model vicuna-7b --prompts standard --single-prompt standard --experiments prompt --hf-inference-method huggingface python -u -m examples.chatbot.main --results-dir results_vllm --models vicuna-7b --single-model vicuna-7b --prompts standard --single-prompt standard --experiments prompt --hf-inference-method vllm
We're seeing different accuracies depending on whether we use the huggingface or vllm inference backends. Here are chrf evaluation scores for the chatbot example when using
huggingface:
vllm:
Commands to reproduce the result (after #161 is merged) are as follows: