zazwaz12 / CITS3200---National-Housing-Simulation

National Housing Simulation - mapping data points from the G-NAF and the census data sets.
0 stars 0 forks source link

Parallelise file-loading #74

Closed SodaVolcano closed 1 month ago

SodaVolcano commented 2 months ago

Feature Summary

Currently, converting GNAF data to a GeoDataFrame in to_geo_dataframe and read_shapefile takes a significant amount of time.

Instead of reading them one by one, modify join_structures_with_shapefile_areas.py to read them in parallel and wait for both files to finish loading before proceeding (and do the same for the overall pipeline once we have that)

More specifically, create a general function that takes in any number of "jobs" to execute in parallel and place it in nhs.utils.parallel.py

def compute_parallel(*jobs: Callable[[Any], [Any]]):
    pass
compute_parallel(lambda: read_csv("./some.csv"), lambda: to_geo_dataframe(...), ...)
SodaVolcano commented 1 month ago

not needed, I've dumped joined GNAF dataset to parquet instead