Open ivsanro1 opened 3 weeks ago
@ivsanro1 that makes a lot of sense. Thinking about other options here, one more possibility could be using tabs \t
instead of |
as a separator. That would still follow the approach that we don't add new non-blank characters to original text, but at the same time preserve the same amount of info as the |
, and this is how tables are represented if you try to copy them and paste into a text field.
Thinking about other options here, one more possibility could be using tabs \t instead of | as a separator
makes sense @lopuhin thanks for your input on this. Originally I was thinking on |
rather than tabs because of how latest LLMs (e.g. llama3) tend to have in their vocab combinations of spaces + tabs, making the resulting tokens less consistent, especially if there are cells in the table without text -- and I was wondering if that'd affect how a LLM would interpret this text, semantically speaking
I find using separators |
more consistent in tokenization:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.encode("\t", add_special_tokens=False)
[197]
>>> tokenizer.encode("\t\t", add_special_tokens=False)
[298]
>>> tokenizer.encode("\t\t\t", add_special_tokens=False)
[573]
>>> tokenizer.encode(" | ", add_special_tokens=False)
[765, 220]
>>> tokenizer.encode(" | | ", add_special_tokens=False)
[765, 220, 765, 220]
>>> tokenizer.encode("| ", add_special_tokens=False)
[91, 220]
>>> tokenizer.encode("| |", add_special_tokens=False)
[91, 220, 765]
>>> tokenizer.encode(" \t \t ", add_special_tokens=False)
[7163, 79199]
>>> tokenizer.encode(" \t \t \t", add_special_tokens=False)
[7163, 256, 63472]
>>> tokenizer.encode(" \t \t \t ", add_special_tokens=False)
[7163, 256, 8860, 3762]
>>> tokenizer.encode(" | | |", add_special_tokens=False)
[765, 220, 765, 220, 765]
>>> tokenizer.encode(" | | | ", add_special_tokens=False)
[765, 220, 765, 220, 765, 220]
But I also like the option of not adding non-spacing chars. I think the best option would be to make it customizable
I think it'd be great to keep some basic sepatarors to not lose too much structural info from tables:
While some better output would be:
@lopuhin do you think this would be relevant for this library?