yamadashy / repomix

📦 Repomix (formerly Repopack) is a powerful tool that packs your entire repository into a single, AI-friendly file. Perfect for when you need to feed your codebase to Large Language Models (LLMs) or other AI tools like Claude, ChatGPT, and Gemini.
MIT License
4.36k stars 202 forks source link

Remove images from notebooks #163

Open IgnacioHeredia opened 2 weeks ago

IgnacioHeredia commented 2 weeks ago

Right now, the repomix generated file includes the images generated in .ipynb.

  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Max Training: 516.0\n",
      "Min Training: 1.0\n",
      "Mean Training: 58.099391480730226\n",
      "Median Training: 39.0\n",
      "\n",
      "\n",
      "Max Validation: 23.0\n",
      "Min Validation: 1.0\n",
      "Mean Validation: 1.8539553752535496\n",
      "Median Validation: 1.0\n",
      "\n",
      "\n"
     ]
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAABIcAAAE/CAYAAADc0KMkAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADl0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uIDI ...

I think there should be some preprocessing to remove them (or put a placeholder), since they don't add any value and take up many space (thus making it harder to retrieve the truly relevant information).

My dirty solution for now is to go and remove the lines:

input_file = 'repomix.txt'
output_file = 'repomix_modified.txt'

with open(input_file, 'r') as file:
    lines = file.readlines()

with open(output_file, 'w') as file:
    for line in lines:
        if not line.startswith('      "image/png": "'):
            file.write(line)

The generated repomix file went from 5MB to 300kB.

yamadashy commented 2 weeks ago

Hi, @IgnacioHeredia . Thanks for reporting this issue!

The image size problem in .ipynb files is a valid concern.

I'm thinking about how to best implement this as a configurable feature. I'll work on implementing this improvement. If you have any thoughts on the configuration approach, feel free to share them!