whitfin / runiq

An efficient way to filter duplicate lines from input, à la uniq.
MIT License
204 stars 23 forks source link

Add a -c (count) flag? #7

Closed pfmoore closed 3 years ago

pfmoore commented 4 years ago

Would it be possible to add a -c flag to output a count of each unique line, like uniq has? A significant proportion of my usage of uniq is in the form of sort | uniq -c | sort -n, and being able to use runiq to replace that initial pair of commands would be really nice.

whitfin commented 3 years ago

Hi!

Sorry, haven't had time to look at this in a while. In this case though, there's really no need to add this. You can do the following:

cat file.txt | runiq | wc -l
pfmoore commented 3 years ago

Sorry for the delay, but your suggestion displays the number of unique values, whereas my proposal displays how many times each unique value occurs.

>cat file.txt
a
b
c
a
c
c
>cat file.txt | runiq | wc -l
5
>cat file.txt | runiq -c
      2 a
      1 b
      3 c

So there really isn't a way to get the -c functionality my PR provides with existing commands 🙁

pfmoore commented 3 years ago

@whitfin Given that your suggested approach doesn't do what I want, could you comment again on the request?

whitfin commented 3 years ago

@pfmoore ah, misread.

What you want isn't really viable, because it requires that all values are stored against counts in memory - while this might be nice for very small inputs, it will explode for large inputs (which defeats the point of why runiq exists).

I can think to see if there's another way, but on the face of it it's not going to be possible.

pfmoore commented 3 years ago

That's a fair point. If we want to write counts, then we definitely do have to keep all of the lines we'll be writing out until the end, simply because we have no way of knowing until we've read all of the input that we won't see another copy of a line we have stored, so we can't start writing anything until the end.

My PR keeps that list of output in a separate data structure, mainly because I didn't know enough about the data structures you were using in the filter module to try including the count information there. It's enough for the size of data I typically use uniq -c on, and I personally feel that it would be enough to note that the -c option needs enough memory to hold all of the output lines, and let the program fail if the user doesn't heed that warning. But I'm OK if you prefer to take the view that keeping memory usage bounded is more important.

My main use case is as a replacement for sort FILE | uniq -c, using a native Windows build, rather than ports of Unix utilities (which typically don't handle Unicode properly on Windows), so I suspect I'm not really in the main target audience for this program. So if my use case doesn't fit the main focus of the code, that's fine.

Thanks for reconsidering my request anyway.