yamadashy / repomix

📦 Repomix (formerly Repopack) is a powerful tool that packs your entire repository into a single, AI-friendly file. Perfect for when you need to feed your codebase to Large Language Models (LLMs) or other AI tools like Claude, ChatGPT, and Gemini.
MIT License
4.14k stars 189 forks source link

Various compression levels to reduce token count #36

Open joshellington opened 3 months ago

joshellington commented 3 months ago

First of all, love this project idea and have been using it very successfully with the new "Projects" feature in Claude.

General idea for this enhancement would be flags to reduce token count/cost through minification or even obfuscation. It's quite easy to manipulate artifacts in Claude etc - let folks try and push the limits easier. Going to work on this tomorrow/this week, but curious if anyone has considered/tried yet.

One option I'm considering: Build a built-in algo comparison method that runs popular "compression" algos against tokenizer measurements - retaining agnostic language support.

yamadashy commented 3 months ago

Thank you for this excellent suggestion! I'm glad Repopack is working well with Claude's "Projects" feature.

Your idea to reduce token count is spot-on. Here's how we could approach it:

  1. Implement basic, language-agnostic optimizations
  2. Add minification by libraries for common languages (JS, HTML, CSS) later.

Could you share more about your specific vision for these optimizations? I'm particularly interested in what you consider "basic, language-agnostic optimizations".

For configuration, I'm considering something like this in repopack.config.json:

{
  "output": {
    "optimize": {
      "minify": true
    }
  }
}

It currently have options like removeComments and removeEmptyLines in the output section. We could either move these into the optimize object or, if we don't anticipate adding many more optimize options, we could simplify it to output.minify. What are your thoughts on this structure?

Also, you mentioned obfuscation in your original suggestion. I'm not quite sure how this would apply in our context. Could you elaborate on what you mean by obfuscation and how you envision it being used with Repopack?

Your insights are invaluable as we shape this feature. Looking forward to your thoughts!

yamadashy commented 3 months ago

By the way, current codebase was hastily designed, making it less adaptable to change. I'm planning to gradually refactor it into a more flexible architecture. This might create some conflicts as you implement new features.

joshellington commented 3 months ago

@yamadashy I like the minify: true option ergonomics.

And I was overly optimistic in assuming there existed a library, etc. that supported minifying/compressing any number of source code file types/languages 🤣. Still haven't found anything yet. Also, scratch obfuscation - not sure where I was coming from on that.

IMO, would want (need?) to support any language at a high-level — versus trying to whack-a-mole (see latest SO survey).

I'm not a compression expert by any means, but I wonder if a generic compression method like Brotli, gzip, etc. could be "natively" decompressed by the popular LLMs (with instruction/guidance from the prompt/header section)? I'll poke around a bit on that.

joshellington commented 3 months ago

Ok, scratch gzip/Brotli compression (did some quick tests and it's just going to hallucinate most places). But looks like there are some papers/projects ongoing that are working on prompt compression, this one as an example: https://llmlingua.com/

Going to dig a bit more.

yamadashy commented 3 months ago

Thank you for your verification!

My first image was to support only major languages, but your approach of doing it comprehensively is much smarter! I'm not familiar with prompt compression, so I'll study it.

joshellington commented 3 months ago

So I dug in a bit more - and we'll likely need to create a JS port to keep it all aligned within the toolchain. Tldr; run a local llama 3.1 model to generate a compressed (and measured improvement) version of every file. I'm thinking it may be out of the scope of this project tbh. But going to keep digging and see if we can replicate the prompt compression lib first, and support lower to mid-end regular hardware.

yamadashy commented 3 months ago

Using LLaMA is quite challenging! I hadn't thought of that idea.

Since it's a feature that's being prepared as an option, I think it's good for users that it's challenging.

I apologize for not having much information about prompt compression, but it would be helpful if you could continue the investigation and share the results.

unformatt commented 3 months ago

An "easy" first approach could be to introduce hooks into repopack to allow arbitrary pre-processing of individual files. Instead of putting the burden on repopack of 1) Implementing multiple/various minification tools 2) Exposing all the config options of those tools in the repopack config..., repopack could have a way to pipe the contents of each file into an arbitrary script or tool in which the repopack user is responsible for implementing the compression via third-party tools or their own scripts.

e.g. npx repopack --pre-processor=./myscript.sh

bnssoftware commented 3 months ago

I too have been using this with great success with the projects feature in Claude. Thank you so much for making this amazing tool! I look forward to future enhancements.