Closed thekevinscott closed 2 months ago
The precipitating reason for this question is that I'm trying to force an eos
token, but the model continues generating.
Here's a sample LogitsProcessor
:
class LogitsProcessor {
pipeline: TextGenerationPipeline;
tokenIds: number[];
stopTokenId: number;
constructor(pipeline: TextGenerationPipeline, str: string) {
this.pipeline = pipeline;
const { input_ids, } = (pipeline.tokenizer as TokenizeFn)(str);
this.tokenIds = [...input_ids.data,].map((n: bigint) => Number(n));
this.stopTokenId = this.pipeline.tokenizer.model.convert_tokens_to_ids([
this.pipeline.tokenizer.getToken('eos_token'),
])[0];
}
processors = [(inputTokens: number[], logits: Tensor) => {
if (inputTokens.length > this.tokenIds.length) {
console.warn('This should not happen');
return logits;
}
const tokenId = this.tokenIds[inputTokens.length];
const id = inputTokens.length === this.tokenIds.length ? this.stopTokenId : this.tokenIds[inputTokens.length];
logits.data.fill(-Infinity);
logits.data[id] = Infinity;
return logits;
},];
[Symbol.iterator]() {
return this.processors.values();
}
}
instantiated with:
const prompt = 'Write me some code';
const logitsProcessor = new LogitsProcessor(this.pipeline, prompt + ' foo');
This will generate output like the following:
Write me some code foo<|endoftext|>
Student: A company has a budget of $5000 to spend on advertising. They want
I don't understand why <|endoftext|>
is being treated like a part of the text output and not an indication to stop generation. I assume the answer is because I'm not understanding something in my initial question above.
Hmm. I just tried with Xenova/gpt2
and I now see the following output:
Write me some code foo<|endoftext|>
So, maybe the issue with the model not stopping is specific to the model being used?
I still don't understand why <|endoftext|>
is being returned as part of the text generated, though.
I still don't understand why <|endoftext|> is being returned as part of the text generated, though.
This was user error. I'm calling .decode()
manually and was neglecting to pass skip_special_tokens
; passing that options successfully omits<|endoftext|>
from the text:
const decoded = tokenizer.decode(outputTokenIds[0], {
skip_special_tokens: true,
});
I still don't understand why phi-1_5_dev
is not stopping on an eos token though.
Here's what I've found for phi-1_5_dev
:
pipeline.model.config.eos_token_id
is 2
. I see this in config.json
pipeline.tokenizer.model.convert_tokens_to_ids([pipeline.tokenizer.getToken('eos_token'),])[0]
is 50256
. I see this in vocab.json
.Returning 2
as the eos token successfully stops generation for phi-1_5_dev
(while returning 50256
does not stop generation). However, the 2
token ID gets incorrectly decoded:
const decoded = tokenizer.decode(outputTokenIds[0], {
skip_special_tokens: true,
});
> "Write me some code foo#"
Which makes sense, as token ID 2
is marked as #
in the vocab.json
.
So I guess this entire thread boils down to: why the discrepancy? Does this indicate a bug in the model? Or am I misunderstanding how I should be decoding output tokens containing eos tokens?
As for your original question, you can use the NoBadWordsLogitsProcessor
logits processor for this (see here). You can use it by setting bad_words_ids
in the generation params object:
// Generate text
const result = await generator(prompt, {
max_new_tokens: 100,
bad_words_ids: [[123]], // list of list of token ids (2D since you can specify a sequence of tokens to skip)
});
Thanks for the response! Those edits look great. Does eos_token_id: null
imply anything in particular, or does it just mean it falls back to whatever the default eos_token_id
is?
As for your original question, you can use the NoBadWordsLogitsProcessor logits processor for this (see here)
Appreciate that reference. My actual use case is a bit more complicated - I'm trying to implement a GBNF grammar parser similar to llama.cpp
's implementation. But good to know this exists so I don't have to reinvent the wheel in the future!
Question
I'm trying to write a custom
LogitsProcessor
and have some questions. For reference, I'm usingXenova/phi-1_5_dev
. I'm trying to implement a custom logic for white or blacklisting tokens, but running into difficulties understanding how to interpret token ids, tokens, and their decoded counterparts.Here's what I think I understand:
vocab.json
, and has 50,257 entries.pipeline.tokenizer.vocab
, translated from the object representation ofvocab.json
({ token: tokenID }
), to an array oftoken
s whose indices correspond totokenID
.vocab.json
has 50,257 entries, butpipeline.tokenizer.vocab
has 50,295 entries. Is this becausepipeline.tokenizer.vocab
also includesadded_tokens.json
?special_tokens_map.json
is already included invocab.json
it appearsvocab.json
at50255
is"Ġgazed"
, but if I decode this character by character (pipeline.tokenizer.decoder.byte_decoder('Ġ')
becomes32
which corresponds to a space" "
) I get" gazed"
. I think these correspond to code points.logits
argument contains scores where the index of each score is thetokenID
. So setting the score at position50255
to-Infinity
should ensure that the token"Ġgazed"
(or, decoded," gazed"
) should never appear.logits
argument I'm getting back for this model in myLogitsProcessor
has dimensions of[51200,]
.pipeline.tokenizer.vocab
has size of is 50,295. That would seem to indicate 905 unused tokens at the end of the tensor; can these be safely ignored, or do they correspond to something important that I'm missing?I'd appreciate any insight or feedback on whether my assumptions above are correct or not. Thank you!