wildlife-dynamics / ecoscope

Conservation data analytics
https://ecoscope.io
BSD 3-Clause "New" or "Revised" License
21 stars 9 forks source link

POC: use async client to download from ER #195

Open walljcg opened 1 week ago

walljcg commented 1 week ago

We have been using the er_client.get_objects_multithreaded() function to download events and observations. If we switch to async rather than multithreads we have the potential to greatly speed up the download time.

atmorling commented 5 days ago

Have linked a WIP PR #198 which has a rough POC for the get_patrol_observations flow

Using the test_asyncer.py script included in the PR to compare the two clients, get_patrol_observations over the MEP-DEV dataset (12 patrols, ~500 observations) goes from ~18s to ~11s However this isn't a great test as the observations are all within a single patrol Hooking it up to MEP (prod), for ~30k observations (last few days of patrol data) the total time reduces from ~152s to ~64s (tests were done on a high latency connection)

That's probably enough of an improvement to take this further, keen to get eyes on the flow as it is just in case others have ideas I haven't considered.

atmorling commented 5 days ago

I noted some slight difference in observation counts for certain subjects (in all cases, the async version fetches more observations). I've investigated this to the point that I'm confident the difference stems from get_objects_multithreaded and not ecoscope code.