zmedelis / bosquet

Tooling to build LLM applications: prompt templating and composition, agents, LLM memory, and other instruments for builders of AI applications.
https://zmedelis.github.io/bosquet/
Eclipse Public License 1.0
280 stars 19 forks source link

how to interop? #22

Closed usametov closed 1 year ago

usametov commented 1 year ago

I am thinking about incorporating existing tools from langchain ecosystem, which is growing very fast. They have mature products, like AutoGPT and BabyAGI. It would be nice if we could extend/reuse some of that in bosquet.

For instance, we certainly need google search. Currently I am using babashka to call python google-search client.

One option would be use a python interop via libpython_clj. Also, we can integrate vector db APIs, like activeloop, because they have serverless REST api. That way I could keep my babashka script which populates deeplake and then use bosquet for the rest of my needs. I know that Vald Vector DB has clojure client, but, unfortunately, langchain tools do not integrate with Vald. I hope they will add it in the near future. In the meantime, we could just build AWS lambda clojure-vald client. Calling AWS Lambda from langchain is easy.

behrica commented 1 year ago

I thought a lot about this, but did not come to a conclusion if anything needed to be added to bosquet itself. Clojure is very different regarding "all need to be integrated", as clojure is "data first". I was happily integrating (ad hoc) a vector database with bosquet and just passed the results of the API call into gen/template. It worked very well and I did not see the need of adding anything to bosquet,

Regarding vector databases: Maybe an Clojure abstraction over vector databases would be useful, but in the spirit of Clojure this would go to a different project, not bosquet, (see as well #10 , #6 )

Regarding "integration" python: I have used Clojure together with python a lot. libpython-clj is working well, in general.

In my view it kinds of hardly ever pays off to "wrap" a python library, unless you do it "ad hoc" for exactly your problem. So I do not see a need to add anything in bosquet neither. You can already today use any python library rather comfortable from Clojure with libpython-clj.

Regarding usage of babashka: This a complete different story. You are aware that babashka is not Clojure ? It supports a (big, but not 100 %) subset of Clojure language. A lot of Clojure libraries will not work with it So maybe you would like that bosquet is / become "babashka compatible ?" I think that libpython-clj is not working with babashka, as it relies on JVM native interface, which is (I think) not working in babashka

usametov commented 1 year ago

yes, data-interop is what I am doing now, and, I think, we can expand on this. I am just thinking of other ideas, like using ElasticSearch as vector store. For instance, we could extend Elastisch lib to make it work with vector data and then integrate it with Bosquet.

zmedelis commented 1 year ago

Agree with @behrica regarding Python interop. Going that path the whole LLM layer can be done with Langchain (or whatnot) and then use python-clj just to pass data around. On the other hand, Python ecosystem is so rich with all sorts of LLM tools that introducing some kind of unifying LLM<->Python<->Bosqyet interoperability might end up in simply unnecessary complexity.

zmedelis commented 1 year ago

Regarding vector databases:

I would think about it in more abstract terms - adding memory, more-than-token-count-limit context handling abstractions in Bosquet. Underneath they might or might not be implemented using vector dbs.

usametov commented 1 year ago

here is a low-hanging fruit: cosine similarity, using tech.ml.dataset: https://gist.github.com/usametov/509f3466e141db6f125b815c5ae12c75

This could work as a simple in-memory vector db. We could also use datalog databases for storing vectors. We will just run two queries: datalog query and cosine-similarity query. I know, this is not fancy, but it is simple and cheap.

Also, many llms can be used to generate triples, which, again, could be inserted into datalog database. This could be used to keep search space for vector queries within predefined limits. Think of it as a self-optimizing database :)

behrica commented 1 year ago

I fully agree that cosine distance of text embeddings are one potential base for text similarity calculations. But what concretey you want to add "here", in bosquet ?

A vector database has one higher level of abstraction, it answers to :

give me x similar texts to this text (and does internally the vector operations) and yes, this is somehow part of the "memory" concept discussed for LLMs

But I cannot see what code to add in bosquet to help using this things without restricting the user on a single technology to use.

usametov commented 1 year ago

Sorry, I just meant to add an utility code to make it more usable. Also, if you want to design abstraction layer for vector db, why not start with in-memory db? You can call it reference implementation and it does not have to be part of bosquet. You will end up with writing adapters for each kind of db, anyway.

behrica commented 1 year ago

"More usable" is if course a good goal, but need to be seen relative to effort

I was looking at langchain in python and its big amount of data loaders. Would be cool to have those in Clojure, but is it really worth to "re-write" them, if libpython-clj allows so easy interop:

Require python modules

(ns try 
  (:require [libpython-clj2.python :as py]
            [libpython-clj2.require :refer [require-python]]))

(require-python '[builtins :as bt])
(py/initialize!)
(py/ '[builtins :as bt])
(py/from-import langchain.document_loaders OpenCityDataLoader)

use python

(def dataset "tmnf-yvry")
(def loader (OpenCityDataLoader  
                                 :city_id "data.sfgov.org"
                                 :dataset_id  dataset
                                 :limit 2000))

(def docs (py/py. loader load))
(-> docs  first
    (py/py.- page_content)
    (bt/eval  {})
    (py/->jvm))

it is as long as the corresponding python code:

dataset = "tmnf-yvry" # crime data
loader = OpenCityDataLoader(city_id="data.sfgov.org",
                         dataset_id=dataset,
                         limit=2000)
docs = loader.load()               
eval(docs[0].page_content)

Which kind of utility code can we imagine to make this even easier ? I cannot imagine this.

behrica commented 1 year ago

But maybe I have a rather extreme opinion on this. I am neither thinking that "Clojure wrappers" for Java libraries are worth the effort.

zmedelis commented 1 year ago

I do not see the need to add some Python interop with Langchain or other LLM Python libs. If there is some specialized, unique, hard-to-reimplement small Python lib out there, integrating which would bring immense value, then why not. But it is not the case. If it is the case, let's open up concrete issues to deal with it.

@behrica thanks for moving out Vector DB discussion to the other Issue where we can continue.

@usametov I am closing this. As noted above I am open to discussions on very concrete Python-based functionality/lib to integrate.