zendesk / ultragrep

the grep that greps the hardest.
Apache License 2.0
29 stars 4 forks source link

SQLite #39

Closed osheroff closed 2 years ago

osheroff commented 10 years ago

I dunno if this is WIP or what. It's clearly a cleaner approach, but this in itself doesn't provide any new functionality at all, and merging means rebuilding all the ultragrep indexes out there. So I dunno. maybe this is RFC? I also am not sure that this will be the final table format.

@vanchi-zendesk

vanchi-zendesk commented 10 years ago

I think it is better to further split ugcat down into gz_random_access or a simple cat . This gz_random_access tool can have it's own index which simply marks raw_offset, gz_offset, gz_header, and this can be immensely useful even from a non ultragrep point of view. This can be a separate tool too.

With that, maybe each request can have : start offset, end offset, ts, other keyword indices. It is completely independent of the gzipping and ultragrep usage itself can add indices based on usage. Maybe we can also ask ultragrep to make indices on demand before our analysis begins.

ultragrep -make-index account /regexp with subgroups/ <group number>
vanchi-zendesk commented 10 years ago

My point is, when we built the index, we figured out the start, the end and the timestamp of a request. Why should ugcat do that ever again?

osheroff commented 10 years ago

yeah, I've been thinking about it. The problem is that we only build indexes on a cron job, and can't be sure that we've built all the way to the end of the file. So even ug_cat will at some point need to shrug its shoulders and ask ug_guts or similar "where's the request boundary?"

vanchi-zendesk commented 10 years ago

I can't find any problem with this code.

However, maybe you should merge this after a couple of days of thought.. I can't think of a good way to fall back either. Maybe you will have better ideas for table design as well.

+1 [ after 2 days of meditation :-) ]

vanchi-zendesk commented 10 years ago

Maybe ug_guts should work this way:

gz_cat (or whatever call it)

building indexes: ug_guts -detect-mode for creating ts and other indexes.

ultragrep: ug_guts -print mode with or without index based on availability, or tailing.

There is also another scope of parallelism here: since each request can be simply read from somewhere using ug_guts, we can open all the indexes, locate a fixed number of requests from there, then can fire up a fixed number of threads, and read all the requests for a small time window in parallel (push them through a queue to restore order). Each thread I presume will be operating from some common file cache and will be operating independently on their own file positions...