Closed behrica closed 1 year ago
Yes this is definitely on the list. Large docs are always part of any realistic use case, thus must have feature.
I could be interested in contributing some basic functionality. For example, we needed a function which splits arbitrary text into "pieces", which "just fit" into a prompt, hence have a maximal number of "GPT-3 tokens" of n (n configurable)
I researched a bit, and the only available GPT-3 tokeniser for JVM seems this:
https://github.com/eisber/tiktoken
using JNI.
I came up with a generic function which splits a given String "optimal" for a LLM, meaning it splits the string in pieces of maximal size, while not going over the token limit.
The function requires to pass a "token-count-function" which counts the tokens for any given input (text)
A perfect implementation of this function would use the same tokenizer then used by the model, which might not be possible today in Clojure / JAVA. There is no GPT-3 tokenizer for JVM as far as I found. There are for python, JavaScript and a Java+JNI one, all requiring more complex setups.
But a user could do this or decide to implement its own "not optimal, pessimistic" token counter using a "heuristic" .
Would you be interested in a PR, starting with this function ?
Thanks for the PR, reviewing and should merge shortly
PR merged
I was looking with interest at your library and I was wondering if "applying LLMs on large documents" is on our feature list ?
I have done this ones ad hoc, so I mean: