xorbitsai / inference

Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
https://inference.readthedocs.io
Apache License 2.0
5.56k stars 458 forks source link

Respone repeated when inference gemma-2 #1818

Open HuuHuy227 opened 4 months ago

HuuHuy227 commented 4 months ago

Describe the bug

After lauched model , respone repeated until max tokens Example: when I ask 'hello' it respones 'HelloHowToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToToTo...' until the max token

qinxuye commented 4 months ago

Are you using Transformers engine?

HuuHuy227 commented 4 months ago

Are you using Transformers engine?

Yes

qinxuye commented 4 months ago

I can reproduce this, but I don't know what happened to Transformers engine, now other formats should work well.

HuuHuy227 commented 4 months ago

I can reproduce this, but I don't know what happened to Transformers engine, now other formats should work well.

Default generate() function of transformers worked well

github-actions[bot] commented 3 months ago

This issue is stale because it has been open for 7 days with no activity.