Feature Summary

Currently, converting GNAF data to a GeoDataFrame in to_geo_dataframe and read_shapefile takes a significant amount of time.

Instead of reading them one by one, modify join_structures_with_shapefile_areas.py to read them in parallel and wait for both files to finish loading before proceeding (and do the same for the overall pipeline once we have that)

More specifically, create a general function that takes in any number of "jobs" to execute in parallel and place it in nhs.utils.parallel.py

def compute_parallel(*jobs: Callable[[Any], [Any]]):
    pass

*jobs is a LIST of FUNCTIONS to execute - each function should be very time intensive, e.g. reading a whole spreadsheet. Example usage would be...

compute_parallel(lambda: read_csv("./some.csv"), lambda: to_geo_dataframe(...), ...)

Notice how, if we want to run read_csv("./some.csv"), we WRAP IT IN lambda - this prevents us from executing read_csv so we can let compute_parallel execute it
- Give it a try - try running lambda: some_function(...) and running some_function(...) and see what's the difference. For the lambda case, what can you do with this object to run the function?
compute_parallel should, given a list jobs, RUN each function in jobs in parallel and wait for them all to finish, and return a list of return outputs from each job
- so if I pass in compute_parallel(lambda: read_csv("./some.csv"), lambda: to_geo_dataframe("../shapefile/"), it will run both jobs in parallel, wait for them to finish, and return (<DataFrame>, <GeoDataFrame>)

zazwaz12 / CITS3200---National-Housing-Simulation

Parallelise file-loading #74

Feature Summary