zmedelis / bosquet

Tooling to build LLM applications: prompt templating and composition, agents, LLM memory, and other instruments for builders of AI applications.
https://zmedelis.github.io/bosquet/
Eclipse Public License 1.0
280 stars 19 forks source link

applying LLMs on long documens #2

Closed behrica closed 1 year ago

behrica commented 1 year ago

I was looking with interest at your library and I was wondering if "applying LLMs on large documents" is on our feature list ?

I have done this ones ad hoc, so I mean:

zmedelis commented 1 year ago

Yes this is definitely on the list. Large docs are always part of any realistic use case, thus must have feature.

behrica commented 1 year ago

I could be interested in contributing some basic functionality. For example, we needed a function which splits arbitrary text into "pieces", which "just fit" into a prompt, hence have a maximal number of "GPT-3 tokens" of n (n configurable)

behrica commented 1 year ago

I researched a bit, and the only available GPT-3 tokeniser for JVM seems this:

https://github.com/eisber/tiktoken

using JNI.

behrica commented 1 year ago

I came up with a generic function which splits a given String "optimal" for a LLM, meaning it splits the string in pieces of maximal size, while not going over the token limit.

The function requires to pass a "token-count-function" which counts the tokens for any given input (text)

A perfect implementation of this function would use the same tokenizer then used by the model, which might not be possible today in Clojure / JAVA. There is no GPT-3 tokenizer for JVM as far as I found. There are for python, JavaScript and a Java+JNI one, all requiring more complex setups.

But a user could do this or decide to implement its own "not optimal, pessimistic" token counter using a "heuristic" .

Would you be interested in a PR, starting with this function ?

zmedelis commented 1 year ago

Thanks for the PR, reviewing and should merge shortly

zmedelis commented 1 year ago

PR merged