zytedata / html-text

MIT License
7 stars 0 forks source link

Keep minimal structure of tables in text #13

Open ivsanro1 opened 3 weeks ago

ivsanro1 commented 3 weeks ago

I think it'd be great to keep some basic sepatarors to not lose too much structural info from tables:

>>> import html_text

>>> tree = fromstring("""
... <table>
...   <tr>
...     <th>Company</th>
...     <th>Contact</th>
...     <th>Country</th>
...   </tr>
...   <tr>
...     <td>Alfreds Futterkiste</td>
...     <td>Maria Anders</td>
...     <td>Germany</td>
...   </tr>
...   <tr>
...     <td>Centro comercial Moctezuma</td>
...     <td>Francisco Chang</td>
...     <td>Mexico</td>
...   </tr>
... </table> 
... """)

>>> print(html_text.extract_text(tree, guess_layout=True))
Company Contact Country
Alfreds Futterkiste Maria Anders Germany
Centro comercial Moctezuma Francisco Chang Mexico

While some better output would be:

Company | Contact | Country
Alfreds Futterkiste | Maria Anders | Germany
Centro comercial Moctezuma | Francisco Chang | Mexico

@lopuhin do you think this would be relevant for this library?

lopuhin commented 3 weeks ago

@ivsanro1 that makes a lot of sense. Thinking about other options here, one more possibility could be using tabs \t instead of | as a separator. That would still follow the approach that we don't add new non-blank characters to original text, but at the same time preserve the same amount of info as the |, and this is how tables are represented if you try to copy them and paste into a text field.

ivsanro1 commented 3 weeks ago

Thinking about other options here, one more possibility could be using tabs \t instead of | as a separator

makes sense @lopuhin thanks for your input on this. Originally I was thinking on | rather than tabs because of how latest LLMs (e.g. llama3) tend to have in their vocab combinations of spaces + tabs, making the resulting tokens less consistent, especially if there are cells in the table without text -- and I was wondering if that'd affect how a LLM would interpret this text, semantically speaking

I find using separators | more consistent in tokenization:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
>>> tokenizer.encode("\t", add_special_tokens=False)
[197]
>>> tokenizer.encode("\t\t", add_special_tokens=False)
[298]
>>> tokenizer.encode("\t\t\t", add_special_tokens=False)
[573]
>>> tokenizer.encode(" | ", add_special_tokens=False)
[765, 220]
>>> tokenizer.encode(" |  | ", add_special_tokens=False)
[765, 220, 765, 220]
>>> tokenizer.encode("| ", add_special_tokens=False)
[91, 220]
>>> tokenizer.encode("|  |", add_special_tokens=False)
[91, 220, 765]
>>> tokenizer.encode(" \t  \t ", add_special_tokens=False)
[7163, 79199]
>>> tokenizer.encode(" \t  \t  \t", add_special_tokens=False)
[7163, 256, 63472]
>>> tokenizer.encode(" \t  \t  \t ", add_special_tokens=False)
[7163, 256, 8860, 3762]
>>> tokenizer.encode(" |  |  |", add_special_tokens=False)
[765, 220, 765, 220, 765]
>>> tokenizer.encode(" |  |  | ", add_special_tokens=False)
[765, 220, 765, 220, 765, 220]

But I also like the option of not adding non-spacing chars. I think the best option would be to make it customizable