whitfin / runiq

An efficient way to filter duplicate lines from input, à la uniq.
MIT License
211 stars 23 forks source link

Feature Request: More unique uniqueness flag #13

Open StaticPH opened 3 years ago

StaticPH commented 3 years ago

As it stands, both runiq and runiq --invert always include a single instance of each value that exists more than once within the inputs. There is not, however, an option to completely omit values that occur more than once. I would like to see some sort of '--no-duped' flag (the name is open for debate), probably mutually exclusive with --invert, that filters out all occurrences of data with duplicates, rather than the current default behavior of leaving a single instance. example:

$ cat fileA
a1
b7
c1
d3
$ cat fileB
a7
b3
d8
c1
d3

With the current behavior, runiq fileA fileB would produce:

a1
b7
c1
d3
a7
b3
d8

runiq --no-duped fileA fileB would then produce:

a1
b7
a7
b3
d8
whitfin commented 3 years ago

Hi @StaticPH!

Although this sounds reasonable, it's likely not viable because it requires storing every value in memory to emit at the end (seeing as you need to process all input to know whether there has been a duplicate or not).

This is probably not something that should be added, since it's far too easy to blow up on memory for unsuspecting users. If there's some magic that might exist to allow this to work more efficiently I'm all ears, but on the face of it it just doesn't seem plausible.

StaticPH commented 3 years ago

I'd be fine with just running runiq twice for such tasks, once with --invert to find all the values with duplicates, and a second time with some flag indicating that values matching subsequent arguments (or even just all values in some FILE; hopefully the code will play nicely with command redirection) should be ignored entirely. Ideally the user would only have to enter the command a single time with a specific new flag, and the internal code would deal with the multiple passes, internally storing only the output from --invert. I assume that keeping access to the beginning of a data stream is possible with some form of peek operation, even if it induces some extra IO buffering.

If both of those particular method stills has the issue of memory blow up, which I suspect it would, it wouldn't be a deal-breaker for the feature to work only with "permanent" files (no command redirection or pipelines).

I could probably get a similar effect by piping the output from the second run through some combination of grep, sed, and awk commands, but I don't think any of those easily supports variable count fixed-string patterns in an automated fashion. It'd be best for usability to have this all happen in one tool with a single run, but 2 runs of that one tool is acceptable considering the good point you raise.