Consider supporting FOSS LLMs

yuiseki commented 11 months ago

Currently, TRIDENT is only based on the OpenAI API as LLMs. The OpenAI API is difficult to freely customize and control the base model. Supporting FOSS LLMs should make it possible to develop TRIDENT as a fair and transparent FOSS AI assistant.

kshitijrajsharma commented 11 months ago

Hi @yuiseki , Lovely you have already opened issue on this , Some open source LLM on your mind that you are looking to do research on ?

yuiseki commented 11 months ago

@kshitijrajsharma Thanks for your comment! I have already done some research on some OSS LLMs. I share an overview of them below.

https://github.com/ggerganov/llama.cpp
- I would like to run TRIDENT on a Raspberry Pi without Internet connectivity as an ultimate goal
- If TRIDENT can be used in situations where power and communications are limited, it will help a lot of people.
- I'm most interested in llama.cpp as a viable option for getting some LLMs to work on the Raspberry Pi
- llama.cpp can even process LLMs reasoning in parallel on multiple Raspberry Pi's!
- https://github.com/ggerganov/llama.cpp/issues/2164
- MIT LICENSE
StableLM
- https://github.com/Stability-AI/StableLM
- Models tuned for conversation are licensed for non-commercial use, CC BY-NC-SA-4.0
- It's a different architecture than LLaMA, so it's going to be difficult to get it to work on a Raspberry Pi using llama.cpp
OpenLLaMA
- https://github.com/openlm-research/open_llama
- LLaMA architecture reworked with commercially available open data
- Completely OSS and available for commercial use
- I was able to confirm that it works with llama.cpp.
- Apache-2.0 license
LLaMA 2
- https://ai.meta.com/llama/
- Better performance than LLaMA, plus trained on commercially available open data
- I was able to confirm that it works with llama.cpp.
- However, the Open Source Initiative has issued a statement that LLaMA 2 is not open source
- https://blog.opensource.org/metas-llama-2-license-is-not-open-source/

In summary, I have started my trial and error with most attention to llama.cpp and OpenLLaMA, but I will continue to keep an eye on LLaMA 2.

kshitijrajsharma commented 11 months ago

Lovely , Are you seeking for help on any of these ? I can checkout few Can you lay out what should I check and test ?

kshitijrajsharma commented 11 months ago

I get little bit hands on LLama model training :

Here is the sample dataset that can be used to train LLama with RLHF

https://huggingface.co/datasets/HuggingFaceH4/stack-exchange-preferences

Now we need similar training dataset to be prepared for overpass questions and query ! There will be some challenges : We might search for training data , in order to retrain or play with it , it needs massive GPU and machine , I tried a demo with collab couldn't go through on free version

Two references : https://lightning.ai/pages/community/tutorial/accelerating-llama-with-fabric-a-comprehensive-guide-to-training-and-fine-tuning-llama/ https://huggingface.co/blog/stackllama

Looks like it can run on 8GB of GPU which is good a standard personal computer nowadays have this , but a Solid GPU and training dataset is needed ,Training dataset is something we can generate by asking community , bootstrapping overpass query examples , Challenge is the machine

yuiseki commented 11 months ago

@kshitijrajsharma Wow! I am amazed at your quickness! Sorry for my delay in responding to you.

I was thinking of having 👍 and 👎 buttons on the frontend user interface of TRIDENT, like the ones on ChatGPT, to actively receive feedback from users. It should allow the model to be continuously improved by building and updating datasets that paired natural language sentences entered by the user and Overpass API queries generated by the model.

But, currently, TRIDENT does not yet have a database to store data permanently, so this will be a major change. So that will be a time-consuming development process.

If you will find the dataset that paired natural language sentences and Overpass API query, please let me know. Narrowing it down from the Stackoverflow question and answer data set seems like a very realistic idea.

https://stackoverflow.com/questions/tagged/overpass-api https://stackoverflow.com/questions/tagged/overpass-api?tab=Votes https://stackoverflow.com/search?tab=votes&q=overpass&searchOn=3

My PC has a 12GB VRAM GPU so I would be able to run the training. I also have a contract with Google Colaboratory Pro.

kshitijrajsharma commented 11 months ago

Does stackoverflow provides API to collect questions and answers ? I need to check license if it enables us to pull data from , I heard stackoverflow is also coming up with its own AI so assuming we might get data to train , if not we can store the result and query from the tool itself , We can design a small python service with database table attach it to the app and include like and dislike button . This seems good idea this will help us collect query and questions tagged with like button and can be used to retrain other LM

yuiseki commented 2 months ago

@kshitijrajsharma Forgive me for contacting you at such a sudden moment. This Issue is my only connection to you. I feel I need your advice now.

I now have the most powerful computing resource of my life.
This is
- Prize of the gold winner of Local AI Hackathon in Japan.
- Exclusive use rights until the end of April.
- 32 core, 64 thread CPU, 128 GB RAM.
- 8 GPUs, total 192GB VRAM.
- However, all results calculated with this computing resource must be published as OSS.
I am now swamped with this computational resource.
I feel I have already done enough of what I need to do.
- https://huggingface.co/yuiseki
Please let me know if there is anything else I should accomplish with this computing resource.

https://github.com/UNopenGIS/7/issues/443

kshitijrajsharma commented 2 months ago

@yuiseki Sorry I was away previous week Thats awesome While you have the resources , I would expect may be you can try running some of the llma with sample spatial queries to start with ? Have you found and updates regarding that issues . How do they performed ? Any closer ??

If you need training data probably this is something we can generate

yuiseki / TRIDENT

Consider supporting FOSS LLMs #58