yamadashy / repomix

📦 Repomix (formerly Repopack) is a powerful tool that packs your entire repository into a single, AI-friendly file. Perfect for when you need to feed your codebase to Large Language Models (LLMs) or other AI tools like Claude, ChatGPT, and Gemini.
MIT License
3.78k stars 171 forks source link

Add file truncation #149

Open klntsky opened 1 week ago

klntsky commented 1 week ago

The use case is: I have multiple JSON data files. I want to include them in the LLM input, but only to show their structure, not the contents. I'd like to be able to specify that I just want to include the first N lines.

yamadashy commented 1 week ago

Hi @klntsky!

I'm thinking of implementing this with a new process config option. Does this kind of structure match what you had in mind?

repomix.config.json

{
  "output": {
    // ... output config
  }
  "process": {
    "maxLines": 100,             // Default limit for all files
    "patterns": [
      {
        "pattern": "**/*.json",  // Special limits for JSON files
        "maxLines": 20
      }
    ]
  }
}

The output would look like:

{
  "users": [
    {
      "id": 1,
      "name": "John"
    }
  ]
... (truncated)

Let me know if this is heading in the right direction!

klntsky commented 1 week ago

In some cases it may be useful to limit chars or words, not lines (e.g. unformatted json). Maybe all three should be configurable?

yamadashy commented 1 week ago

@klntsky If I'm understanding your intention correctly, I think the underlying issue here is that including entire file contents can consume a large number of tokens, which is a common problem for projects using repomix with LLMs.

Given this context and considering how LLMs process text, I think focusing on token count would be the most appropriate approach initially. Something like:

{
  "process": {
    "maxTokens": 1000,          // Global token limit
    "patterns": [
      {
        "pattern": "**/*.json",  
        "maxTokens": 500        // Pattern-specific token limit
      }
    ]
  }
}

I'd like to start with this simpler requirement to minimize potential bugs.

What do you think about this approach?

klntsky commented 1 week ago

Yep, token limits seem to cover both cases, but I'd like to have lines too, because it's not immediately clear how many tokens are there in a part of the file, while lines can be inspected visually.

yamadashy commented 1 week ago

That makes sense. We could support both maxLines and maxTokens, truncating when either limit is reached.

Let me think about this a bit more.