zeehio / parmap

Easy to use map and starmap python equivalents
Apache License 2.0
144 stars 9 forks source link

Are you a parmap user? Please enter #11

Open zeehio opened 7 years ago

zeehio commented 7 years ago

Hi,

I have curiosity to know who is using parmap and for what purpose. Sometimes I believe there are no users out there and then I feel happy when someone pops by and opens an issue. If you are using parmap and want to leave a note, please do that here. I would be very happy to know what is parmap being used for. Once you have answered feel free to click on "Unsubscribe" on the right if you don't want to receive further notifications from other parmap users.

For instance here is one user that wrote me about his paper on spinning black hole binaries where he had used parmap:

Thanks!

mon commented 7 years ago

I actually found parmap on stackoverflow whilst looking for a nice, py2+py3 way to provide constant variables to map. Finding it supported tqdm was very pleasant. I'm using it help me process about 300GB of seismic data, I hand off to parmap to perform analysis calculations. Thanks for the useful library!

saddy001 commented 6 years ago

I'm using it for custom scikit-learn estimators.

saddy001 commented 6 years ago

You could attract potential users if you would add parmap as an answer to related questions on stackoverflow (e.g. https://stackoverflow.com/q/9911819). Indeed, I found it the best solution I tested. -- but you should state that you're the author

zeehio commented 6 years ago

Thanks for the tip. I am not actively searching for more users though. It's great if they find parmap and they like it, and I will talk about parmap to anyone that might be interested. However, I can't spend time on finding users who might like parmap right now, and if these users came I would need to spend even more time fixing issues.

So, when I have the time I will start actively looking for more users. Until then they will have to find parmap if they want to. Feel free to tell others about parmap if you want, though.

Strizzi12 commented 6 years ago

I am currently using parmap for my master thesis about emotion detection in tweets.

acere commented 5 years ago

Just found parmap and loving it, it saved me a lot of partial and pool calls! As for the application: signal analysis for single photon detectors.

zhenglilei commented 5 years ago

One line code to use parallel computation and with progress bar. I love this tiny tool very much. I use it everywhere I need parallelization.

gryBox commented 4 years ago

Hi - I am using parmap for generating nodes in knowledge graphs. A couple of questions:

  1. If `pm_processes is not passed. Does the number of processes scale to the max available?
  2. If each item in the list spawns a long process - Is chunking a good way to speed things up further?
zeehio commented 4 years ago

@gryBox

Empty pm_processes

If pm_processes is not passed, parmap follows multiprocessing.Pool defaults and therefore uses os.cpu_count().

About chunksize values

By default, the chunksize is len(iterable)/(4*pm_parallel), rounding up if necessary. This is also the default from multiprocessing. If you have 200 tasks and 5 parallel processes, chunksize = 200/(4*5) = 10.

I will try to explain here why that default is reasonable going to the extremes:

chunksize = 1

Using a chunksize of 1 would mean that each task is submitted individually. As soon as one task is finished, the main process submits another one. This would be fine if submitting a task did not have any overhead, which is not the case. If each taks takes a short time to finish, using such a small chunksize would mean that multiprocessing has to spend in comparison a lot of time on submitting data and getting back the results. In this case parallelizing with chunksize=1 could make the code run slower.

chunksize = number of tasks

If you just create one big chunk, you can only send it to one process, so you can't parallelize. It is an absurdly high value. Instead of using this value please disable parallelization.

chunksize = num_tasks/num_processes

You split your tasks in as many groups as parallel processes. You minimize the submissions, so the overhead is minimal. This may seem like a very smart approach, but what happens if tasks take a different amount of time to complete? With this approach, if you have bad luck, one of your processes may get one or a lot of long tasks and while the other processes have finished, you will need to wait for that one process to finish multiple tasks. All tasks have already been submitted so the other processes can't do anything to help the process that has been given too much work.

chunksize = num_tasks/(4*num_processes)

This is a reasonable tradeoff. Each process would get on average 4 submissions of tasks. If one task was much longer than the rest, the process with that task would probably get 3 or 2 submissions and other processes would get 5 submissions each. While the overhead is a little bit bigger, the benefit on the general case is much larger.

chunksize summary

In summary, the default is usually good enough. If you have a huge amount of equally super short tasks maybe a larger chunksize would be significantly beneficial. I haven't done any formal benchmark, feel free to do so if you want.

gryBox commented 4 years ago

@zeehio Thank you that is a clear and easy explanation. Leaving things at default for now. Wonderful tool!

lewismc commented 2 years ago

tagbase-server uses parmap to asynchronously process biologging data from electronic tags deployed on various marine animals. This is an excellent utility library. Thank you @zeehio 👍

XChikuX commented 2 years ago

@zeehio I used parmap to target 24 million github repos for their language dependency files a few years ago. This was a part of some security analysis I was doing during my Master's Very glad this tool existed; especially since I didn't want to move to a compiled language for multi processing stuff.