zeek / pysubnettree

A Python Module for CIDR Lookups
Other
51 stars 20 forks source link

Very high memory usage with large sets #26

Open e271828- opened 3 years ago

e271828- commented 3 years ago

Hello, while importing a 115MB list of CIDRs into a pysubnettree set, I noticed that the final memory structure was ~10x the size of the list of strings.

This is counterintuitive: I'd have expected it to be radically smaller. What am I missing?

rsmmr commented 3 years ago

Well, I'd expect some overhead for sure, it's putting it all items into a tree structure that needs management. I'm not sure how much overhead I would expect honestly but I can see it being substantial given that the original string representation is about the most compact way possible. Either way, there's not much we can do about it I'm afraid if it's really just coming out of the underlying patricia tree. It could be a memory leak instead, then it'd be a bug obviously; do you know if everything gets cleaned up ok when the data structure is destroyed?

e271828- commented 3 years ago

Found another solution, so can't report back on cleanup. However, not sure I agree with you on string representation: an IPv4 address is a 32 bit number. str('123.123.123.123') = 15*8 = 120 bits. There were likely some adjacent ranges in my large set. Thus I would have expected a final memory size of 1/2 or so even with 2x overhead.

Was also assuming space-saving extensions for adjacencies occurred as a side effect of the trie at add time, but that does not appear to be the case. Probably something like an OET would be a better fit for larger lists.