Open walljcg opened 2 weeks ago
In an initial exploration session today I'm encouraged by what I'm seeing with lithops so far. 👍
Using the lithops.LocalhostExecutor
, I've adapted this official map reduce example into a script (see below) that generates EcoMap
s in parallel using python multiprocessing (mediated by lithops).
Initial results:
➜ python3 mr.py 1
Splitting source table into 1 parts
Elapsed time for generating 1 ecomaps: 12s
➜ python3 mr.py 2
Splitting source table into 2 parts
Elapsed time for generating 2 ecomaps: 12s
➜ python3 mr.py 5
Splitting source table into 5 parts
Elapsed time for generating 5 ecomaps: 13s
➜ python3 mr.py 15
Splitting source table into 15 parts
Elapsed time for generating 15 ecomaps: 28s
So wall time is basically flat from 1 -> 5 maps, and then once I get to 15, it starts going up, because I've crossed the threshold of the number of cores I have on my laptop (8), and the scheduler can't run everything in parallel.
Next steps which I'll work on later this weekend:
ecoscope-workflows
This looks great Charles and indeed very promising. I'm curious about the spawn_reducer parameter - doesn't setting that to zero effectively then not wait for any results to be returned before executing the reduce_function()?
Good question, I definitely don't 100% grok this option yet, particularly how it behaves in a local context, but my basic understanding (which could be wrong) was that this option determines when the container / process that will eventually execute the reduction is started, not when it is called. For sure worth looking into this further, which I will do.
We want to be able to perform split-apply-combine operations on (geo)pandas dataframes in a distributed / scalable way both on a single multi-core machine and in the cloud. It looks like lithops may take care of the architecture for this already (https://github.com/lithops-cloud/lithops). But we need to research how to generalize the use of lithops across tasks may need both the local/cloud split/apply/combine paradigm.