Open cisaacstern opened 1 week ago
@walljcg here's a start on how we can leverage lithops map reduce (as demonstrated in https://github.com/wildlife-dynamics/ecoscope-workflows/issues/28#issuecomment-2183694749) in the context of ecoscope workflows/tasks. The core task is here:
And here is an example of what it might look like compiled into a runnable workflow:
💡 Note that this does not yet run, but is more-so one step above pseudocode. Among other things, of course we don't actually need to split-apply-combine for time density maps in this way, I'm just using that as a toy example here.
In terms of where a script like this could run, if it runs locally, it can parallelize across python processes. In the cloud, we could package this script itself as a "launcher" serverless function on GCP Cloud Run, which once it gets to the lithops map-reduce section, would spawn additional Cloud Run functions. The "launcher" function (running this script) would then wait for all of the parallelized tasks to complete, gather the result, and send it back to wherever we need it (database api call to the server maybe).
Note also that this pattern (launch lithops from another cloud function) is conceptually almost identical to the Lithops Airflow Operator. (Which we may also want to use in the future for more complex DAGs, but "deploy lithops from a cloud function" is definitely lower latency for "easy/small" DAGs, as we've discussed.)
I've also iterated a bit on the workflow YAML spec design, so we can represent the map reduce operation there something like this:
To make this more legible, I'm experimenting with borrowing the GitHub Actions data structure of:
- name: "a human readable name"
id: |
# a unique id. we were using the task names for this before,
# but to support reuse of the same task within a workflow, we'll need an `id`
uses: # the action name in github, or for us, the task importable reference
from: |
# i'm not actually using this here, and i don't think it's part of GitHub Actions,
# but i thought this might be a nice way to include the github or pypi path for
# extension task packages, which could be dynamically installed in a venv at compile
# time (like pre-commit). https://github.com/moradology/venvception is a nice little
# package that does this (ephemeral venvs), that a former collaborator wrote for our
# last project.
with:
# kwargs to pass to the task
Based on the current working example here, some observations/thoughts re: performance and optimization (all of the below is based on local testing; I have not run on Cloud Run yet):
map_function
(turns a single group into a map widget), without any lithops
overhead, takes about 10 seconds.lithops
adds about 10 seconds of overhead, for a total 20 seconds runtime (with no attempts at optimization yet).calculate_time_density
itself (I have not profiled further within that call yet)ecoscope
core modules can be costly. If any tasks use the same ecoscope
sub-packages, then sharing an import cache could be meaningful (I don't know if the namespace separation of just importing in function scope prevents this, but I think it may?)... but if there is not module overlap, this may not matter.ecoscope
could have a meaningful impact.calulate_time_density
) themselves.Next steps:
parallel_collection
for mapping all widgets simultaneously?gather_dashboard
- About 4 whole seconds are for calling
calculate_time_density
itself (I have not profiled further within that call yet)
Just throwing this up here as a data point:
this loop and the calls to intersect1d
are the bulk of the time
https://github.com/wildlife-dynamics/ecoscope/blob/6e7dd1ea782c2eef07d7bf794be76e7d4a077042/ecoscope/analysis/UD/etd_range.py#L191-L198
Wow @atmorling this is a heroic insight!! If the cost is largely numpy, then it should be optimizable!
Can you share how you generated this profile image? I am still a n00b when it comes to this stuff
This is just the result of running cProfile
a test script visualized via SnakeViz , specifically:
python -m cProfile -o etd.prof test_etd.py
python -m snakeviz etd.prof
I took another look at this and there's a sort inside intersect1d
, and we call intersect1d
a lot (ncalls in the below image)
(this is from the same visualization as above)
Took a stap at jit-compiling that loop, so far to no avail 😄 https://github.com/wildlife-dynamics/ecoscope/pull/193
A few quick notes on build / deploy / call commands:
Build the runtime (container) with:
$ lithops git:(split-apply-combine) ✗ LITHOPS_CONFIG_FILE=../../.ecoscope-workflows-tmp/.lithops_config \
lithops runtime build -b gcp_cloudrun -f Dockerfile.cloudrun ecoscope-workflows-runtime
Deploy it with:
$ lithops git:(split-apply-combine) ✗ LITHOPS_CONFIG_FILE=../../.ecoscope-workflows-tmp/.lithops_config \
lithops runtime deploy -b gcp_cloudrun ecoscope-workflows-runtime
Call the script with:
$ ecoscope-workflows git:(split-apply-combine) LITHOPS_CONFIG_FILE=.ecoscope-workflows-tmp/.lithops_config \
ECOSCOPE_WORKFLOWS_TMP=gs://ecoscope-workflows-tmp \
python3 examples/dags/time_density_map_reduce.script_lithops.py
All with config:
# .ecoscope-workflows-tmp/.lithops_confi
lithops:
backend: gcp_cloudrun
storage: gcp_storage
gcp:
region: us-central1
credentials_path: <path to service account json>
gcp_cloudrun:
runtime: us.gcr.io/ecoscope-poc/ecoscope-workflows-runtime
runtime_cpu: 2
runtime_memory: 1000
Closes #28