Replace OpenAI GPT with another LLM in your app by changing a single line of code. Xinference gives you the freedom to use any LLM you need. With Xinference, you're empowered to run inference with any open-source language models, speech recognition models, and multimodal models, whether in the cloud, on-premises, or even on your laptop.
I've encountered an issue with the following code in certain models:
for i in range(max_new_tokens):
if i == 0:
if model.config.is_encoder_decoder:
out = model.decoder(
input_ids=start_ids,
encoder_hidden_states=encoder_output,
use_cache=True,
)
logits = model.lm_head(out[0])
else:
out = model(torch.as_tensor([input_ids], device=device), use_cache=True)
logits = out.logits
past_key_values = out.past_key_values
else:
if model.config.is_encoder_decoder:
out = model.decoder(
input_ids=torch.as_tensor(
[[token] if not sent_interrupt else output_ids], device=device
),
encoder_hidden_states=encoder_output,
use_cache=True,
past_key_values=past_key_values if not sent_interrupt else None,
)
sent_interrupt = False
logits = model.lm_head(out[0])
else:
out = model(
input_ids=torch.as_tensor(
[[token] if not sent_interrupt else output_ids], device=device
),
use_cache=True,
past_key_values=past_key_values if not sent_interrupt else None,
)
sent_interrupt = False
logits = out.logits
past_key_values = out.past_key_values
The error message I'm receiving is:
[address=0.0.0.0:43813, pid=2037] Attention mask should be of size(1,1, 1, 20), but is torch.Size([1, 1, 1, 1])
This appears to be an attention mask size mismatch. I'd appreciate any insights into what might be causing this error and how to resolve it.
Additionally, I have a question about the input choice in the code. I noticed that for subsequent iterations (when i > 0), the code uses only the last generated token as input:
input_ids=torch.as_tensor([[token] if not sent_interrupt else output_ids], device=device)
Why is the last token used as input rather than the complete output_ids? Is there a specific reason for this design choice, and what are the implications for the model's performance and output?
Thank you for your time and assistance.
Would you like me to modify or expand on any part of this issue?
Expected behavior / 期待表现
Clarification on whether using the last token as input is intentional and optimal, or if using the complete output_ids would be more appropriate.
Thank you for your time and assistance.
System Info / 系統信息
cuda12.4
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
Version info / 版本信息
v0.13.0
The command used to start Xinference / 用以启动 xinference 的命令
docker run --gpus "device=0" -p 9998:9997 -itd --restart unless-stopped -v
pwd
:pwd
--name xinference-013 registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v0.13.0 xinference-local --host 0.0.0.0 --port 9997Reproduction / 复现过程
Body:
Hello,
I've encountered an issue with the following code in certain models:
The error message I'm receiving is:
This appears to be an attention mask size mismatch. I'd appreciate any insights into what might be causing this error and how to resolve it.
Additionally, I have a question about the input choice in the code. I noticed that for subsequent iterations (when
i > 0
), the code uses only the last generated token as input:Why is the last token used as input rather than the complete output_ids? Is there a specific reason for this design choice, and what are the implications for the model's performance and output? Thank you for your time and assistance.
Would you like me to modify or expand on any part of this issue?
Expected behavior / 期待表现
Clarification on whether using the last token as input is intentional and optimal, or if using the complete output_ids would be more appropriate. Thank you for your time and assistance.