zulip / python-zulip-api

Python library for the Zulip API.
https://zulip.com/api/
Apache License 2.0
350 stars 356 forks source link

Can I get a total count of messages given a narrow filter? #650

Closed dwinston closed 3 years ago

dwinston commented 3 years ago

I am aware of the found_newest value returned by GET /api/v1/messages, which I use in paginating requests to decide whether there are more messages to fetch. However, there are a lot of messages on the server, and I would like to be able to show myself a progress bar so as to estimate the time to complete my fetch of messages. Is there a (perhaps undocumented?) was to get such a count for a narrow filter?

timabbott commented 3 years ago

There isn't a way to do this, in that the server doesn't fetch from the database more messages than it needs to return your query (doing otherwise would make the query much slower).

One thing that you could do to hack this would be to fetch the very oldest message in the organization, and the very newest, and then do a progress bar based on what fraction of the ID space you've covered; it wouldn't be super accurate, but would result in a useful progress bar in most cases.

OOC what's the use case you had in mind where you want such a progress bar?

dwinston commented 3 years ago

Thanks @timabbott -- that's a great idea with the ID space. My use case is expressed in this gist -- I wanted to get all public-stream messages so that I could use the data to have fun trying to apply PageRank to Zulip entities (e.g. users @-mentioning others).

My initial code was incorrect because I had misunderstood the meaning and usage of the "anchor" key in the API response; it turned out to be re-downloading messages again and again, which I misinterpreted as there just being a lot (millions!) of messages. I was running it for hours, and I just wanted to know when I might expect it to be done!

Once I figured out my bug, the whole thing took under 15 minutes for me for the Zulip server I was fetching from. I still don't have a progress bar with ETA in the gist I linked to, but your ID-space idea would do the trick.

timabbott commented 3 years ago

Cool. We would like the Zulip python bindings to have a client.get_all_messages(...) method that basically does that loop (taking a narrow as a parameter, which could be the streams:public one that I think your example does) -- it feels like folks reimplement that a lot. If you want to polish your gist and submit it as a PR for that purpose, we can try to merge it as the official implementation.

timabbott commented 3 years ago

Also worth noting that https://github.com/zulip/zulip-archive has a copy of that loop embedded in it.

dwinston commented 3 years ago

Oh wow, I did not think to poke around for something like the zulip-archive repo. I pretty much stuck with poring over https://zulip.com/api/get-messages. It does seem that I re-implemented a slice of populate.py! I just might attempt such a PR...thanks for the invitation!