xlang-ai / Spider2

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows
https://spider2-sql.github.io
Apache License 2.0
157 stars 14 forks source link

CodeS baseline #7

Closed dannalily closed 1 month ago

dannalily commented 1 month ago

Hi, I have a question regarding how schema filtering is handled before passing the schema into CodeS.

I noticed that the CodeS baseline includes a schema filter. However, since the tokenizer used in the filter has a maximum length of 512, it seems challenging to input the entire database schema, especially with databases averaging around 1000 columns.

Could you explain how this is managed or clarify how the schema filter works? Many thanks for your help!

yuxiaooye commented 1 month ago

Thanks for your issue and sorry for the late reply!

Actually, CodeS already handles large schemas by splitting them into small batches no longer than 500 tokens (see here) before being fed into the schema filter.

You may encounter a misleading warning: Token indices sequence length is longer than the specified maximum sequence length for this model..., which might cause confusion. However, this only occurs when checking the length of the current batch, and it is safe. Now this warning is muted.