zazwaz12 / CITS3200---National-Housing-Simulation

National Housing Simulation - mapping data points from the G-NAF and the census data sets.
0 stars 0 forks source link

Allocate individuals to buildings :skull: #28

Closed SodaVolcano closed 1 month ago

SodaVolcano commented 2 months ago

Write a function that takes in two LazyFrames, one for the census data with filtered down columns, and one for the GNAF data containing buildings we want to allocate for. The function will output a LazyFrame containing the building rows, with columns for the SA1 area code, building longitude and latitude, number of people allocated to it, and the census columns showing how many people of specific demographic is allocated to it (e.g. ["SA1_CODE_2021", "longitude", "latitude", "n_residents", "Christianity_Anglican_F", "Christianity_Anglican_M", "Buddism_F", "Buddism_M", ...]

NOTE: Each column in census data has the number of individuals with that attribute. If the total number is included in census data LazyFrame, write a separate function to check and remove that column NOTE 2: must have ability to allocate multiple people to a building, the number of buildings likely will be less than the number of people to allocate

ctmes commented 2 months ago

Quickly, n_residents may be done by: ratio = number of people / number of houses x = round_down(ratio) y = number of people % number of houses

The first y houses will hold x+1 people, while the rest hold x.

E.g. 6 / 4 ratio = 6/4 = 1.5 x = 1 y = 2

the first 2 houses hold 2 people, the rest hold 1. So allocation will be: 2,2,1,1 = 6

E.g. 15/4 ratio = 15/4 = 3.75 x = 3 y = 15 % 4 = 3

First 3 houses hold 4 people, rest hold 3 4,4,4,3 = 15

And then splice the frame to assign, rather than some sort of naive iterative approach

SodaVolcano commented 2 months ago

The allocation wouldn't be random I don't think. In the example, the first house is guaranteed to have 1 more person living in it so the order of the house matters. If we have 3 people and 6 houses, then the last 6 houses are guaranteed to not have anyone allocated to them. Lazy evaluation may make it difficult to select a random house given it doesn't give you access to the full list at once by design.

ctmes commented 2 months ago

Oh yeah the first n houses would be more populous, and depending on how these houses are assigned (maybe north west to south east), those houses would tend to have more. We can use random.shuffle() to shuffle up the list, dict, whatever the data structure is?

SodaVolcano commented 2 months ago

There's polars.DataFrame.sample which can shuffle the rows, but this means we'll need to evaluate the LazyFrame to a DataFrame object. Hopefully after conversion to parquet the DataFrame won't take too long to process. Alternatively if the number of rows is easy to calculate, we can construct a new DataFrame that assigns people to the row numbers and join it with the LazyFrame. If this is the final step in the pipeline then evaluating to DataFrame shouldn't be an issue since we'll need to go through all rows anyway

ctmes commented 2 months ago

I think calculating the number of rows should be fairly quick/easy based on my experience with the datasets so far just using pandas, but true we should try do all the O(n) stuff at once