Closed dannalily closed 1 month ago
Thanks for your issue and sorry for the late reply!
Actually, CodeS already handles large schemas by splitting them into small batches no longer than 500 tokens (see here) before being fed into the schema filter.
You may encounter a misleading warning: Token indices sequence length is longer than the specified maximum sequence length for this model...
, which might cause confusion. However, this only occurs when checking the length of the current batch, and it is safe. Now this warning is muted.
Hi, I have a question regarding how schema filtering is handled before passing the schema into CodeS.
I noticed that the CodeS baseline includes a schema filter. However, since the tokenizer used in the filter has a maximum length of 512, it seems challenging to input the entire database schema, especially with databases averaging around 1000 columns.
Could you explain how this is managed or clarify how the schema filter works? Many thanks for your help!