Can't change polling speed/batch options when piping output to rare

Hey, I just discovered this and really liked it, however I found a bug when trying to follow large standard output. if I have the following python file which just generates a bunch of numbers:

import random

BOXES = 100
BALLS = 100

def sim():
    boxes = [0 for a in range(BOXES)]
    for ball in range(BALLS):
        boxes[random.randint(0, BOXES-1)] += 1
    return len([b for b in boxes if b == 0])

for i in range(10000000):
    print(sim())

And I want to use rare to show the distribution of the output live while the simulation runs, I would do python3 ./random_100_box_ball.py | rare histo -f -x --sortkey -n 30 --batch 1

However, the updates through the pipe come once a second, even after setting the batch size to something small like 1.

https://user-images.githubusercontent.com/30570611/164960639-ef2ee840-51c3-4e7b-b7fa-a64accbd4c06.mp4

If I instead piped the output to a file, and then read from the file using rare, it produces the live following that I was after

python3 random_100_box_ball.py > test.txt
rare histo -f --sortkey -n 30 --batch 1 ./test.txt

https://user-images.githubusercontent.com/30570611/164960656-a841fbe3-6314-4fd2-90ad-cba9e71aa452.mp4

batch settings are respected and the updates can be tuned when not using pipe. I was wondering if you knew why this was the case? And if you could get pipes to emulate the behaviour seen when you instead follow a file?

N.B. I'm running ubuntu 20.04 WSL2 on Windows 10, using the prebuilt binary (rare, but rare-pcre suffers from the same issue)

04:55:37 ~  -> rare --version
rare version 0.2.1, 29f1bd5; regex: re2
04:55:39 ~  -> rare-pcre --version
rare-pcre version 0.2.1, 29f1bd5; regex: libpcre2

Thanks for the report.

I did some testing with your script, and can definitely reproduce. I did a little debugging, and it looks like it's partially due to the read-ahead buffer (a piece of code that optimizes reading from disk) being fixed to 128 KB. This means it won't update the data until it receives at least 128 KB of data, and that takes a few seconds for your python script to produce.

My current thought is that this might be sub-optimal for reading from stdin, since it's already off-disk, and can't really gain much performance from attempting to read-ahead. I'll do a little more experimenting to see if I can get something that works well and maintains high performance.

edit: Also, I tried running that python script through pv to measure how quickly it generated data (python3 ./test.py | pv > /dev/null) and it looks like the python generates ~15 KB/s in bursts, which makes sense given that you build up a bunch of data before printing. When I drop the read-ahead buffer size, I see similar throughput. You'll still see some jumpiness because of how the python script outputs data, but I think fixing the read-ahead will make it much better.

zix99 / rare

Can't change polling speed/batch options when piping output to rare #61