Currently, converting GNAF data to a GeoDataFrame in to_geo_dataframe and read_shapefile takes a significant amount of time.
Instead of reading them one by one, modify join_structures_with_shapefile_areas.py to read them in parallel and wait for both files to finish loading before proceeding (and do the same for the overall pipeline once we have that)
More specifically, create a general function that takes in any number of "jobs" to execute in parallel and place it in nhs.utils.parallel.py
Notice how, if we want to run read_csv("./some.csv"), we WRAP IT IN lambda - this prevents us from executing read_csv so we can let compute_parallel execute it
Give it a try - try running lambda: some_function(...) and running some_function(...) and see what's the difference. For the lambda case, what can you do with this object to run the function?
compute_parallel should, given a list jobs, RUN each function in jobs in parallel and wait for them all to finish, and return a list of return outputs from each job
so if I pass in compute_parallel(lambda: read_csv("./some.csv"), lambda: to_geo_dataframe("../shapefile/"), it will run both jobs in parallel, wait for them to finish, and return (<DataFrame>, <GeoDataFrame>)
Feature Summary
Currently, converting GNAF data to a
GeoDataFrame
into_geo_dataframe
andread_shapefile
takes a significant amount of time.Instead of reading them one by one, modify
join_structures_with_shapefile_areas.py
to read them in parallel and wait for both files to finish loading before proceeding (and do the same for the overall pipeline once we have that)More specifically, create a general function that takes in any number of "jobs" to execute in parallel and place it in
nhs.utils.parallel.py
*jobs
is a LIST of FUNCTIONS to execute - each function should be very time intensive, e.g. reading a whole spreadsheet. Example usage would be...read_csv("./some.csv")
, we WRAP IT INlambda
- this prevents us from executingread_csv
so we can letcompute_parallel
execute itlambda: some_function(...)
and runningsome_function(...)
and see what's the difference. For thelambda
case, what can you do with this object to run the function?compute_parallel
should, given a listjobs
, RUN each function injobs
in parallel and wait for them all to finish, and return a list of return outputs from each jobcompute_parallel(lambda: read_csv("./some.csv"), lambda: to_geo_dataframe("../shapefile/")
, it will run both jobs in parallel, wait for them to finish, and return(<DataFrame>, <GeoDataFrame>)