yl2chen / cidranger

Fast IP to CIDR lookup in Golang
MIT License
899 stars 105 forks source link

Slow performance populating tree #48

Open phemmer opened 1 year ago

phemmer commented 1 year ago

I was looking at various modules that can provide a fast lookup for whether an IP is in a list, and while testing this module, it turned out to have significantly slower load time compared to the other alternatives I looked at. The lookup is indeed faster, but not enough to justify the increased load time, which for me is prohibitively slow (20x slower than the fastest alternative, and takes over a minute to load a list of about 5 million IPs). I just wanted to create this issue so you're aware, and in case there's something which can be done to increase performance.

Benchmark code ``` package main import ( "encoding/binary" "math/rand" "net" "strconv" "testing" "github.com/asergeyev/nradix" "github.com/infobloxopen/go-trees/iptree" "github.com/yl2chen/cidranger" ) var rng = rand.New(rand.NewSource(0)) func randIP(bits int) net.IP { var ipa [4]byte ip := net.IP(ipa[:]) binary.BigEndian.PutUint32(ip[:], uint32(rng.Intn(1<
BenchmarkRangerLoad-16                 2     797361462 ns/op
BenchmarkInfobloxLoad-16              14      88521051 ns/op
BenchmarkNRadixLoad-16                27      40973037 ns/op
BenchmarkRangerRead-16           8325171           156.6 ns/op
BenchmarkInfobloxRead-16         2999112           450.3 ns/op
BenchmarkNRadixRead-16           3116860           362.2 ns/op

 

Edit: Update: In the end I ended up with a private fork that strips out a ton of code from cidranger to get the performance up. Mostly I removed support for all the different address types and used only netip types, as well as adding a loader which uses the last insert point as the starting point to search for a location for the new node. This got load times over 20x faster, and lookups 1.5x faster.
https://gist.github.com/phemmer/6231b12d5207ea93a1690ddc44a2c811

AmitThakkar commented 7 months ago

@phemmer What are the other alternatives you found?

phemmer commented 7 months ago

They're listed in the benchmark code on the issue description.

ldkingvivi commented 6 months ago

@phemmer Find func may not seem do what you want, it doesn't return most specific result. Try two entry in trie like 8.8.8.0/24 and 8.8.8.0/25 each with some value, and Find "8.8.8.8", it will show up /24 value instead of /25 value.

I think the original fn just Contain, which return bool and doesn't matter if that's more specific or not, and your Find fn try to return most specific entry(smallest prefix), which require follow the children path

maybe change the find fn to

func (p *Trie) find(number netip.Addr) any {
    if !netContains(p.network, number) {
        return nil
    }

    if p.network.Bits() == 128 {
        return nil
    }
    bit := p.discriminatorBitFromIP(number)
    child := p.children[bit]
    if child != nil {
           if r := child.find(number); r != nil{
        return r
           }
        }

        if p.value != nil {
        return p.value
    }

    return nil
}
gaby commented 6 months ago

@phemmer I'm guessing your gist is under MIT License, since it's based on cidranger. I can develop/publish a go module with this code, tests, benchmarks, docs and give you credit in the README. Let me know if that is fine with you.

gaby commented 6 months ago

@ldkingvivi Seems like you integrated those changes already and added even more functionality. The Contains() bool function is very useful!

phemmer commented 6 months ago

Yes, you may do what you want with it.

Also as a side note, I'd encourage discussion on the gist itself, as it's linked to from places other than here.

I also updated it to address the issue regarding find(). Solution was basically the same as what @ldkingvivi proposed.
I also included another fix when matching against v4 /32 or v6 /128 addresses.

phemmer commented 6 months ago

Since there seems to be moderate interest in my implementation, I've published it as a full package here: https://github.com/phemmer/go-iptrie/

gaby commented 6 months ago

@phemmer Awesome, from my local benchmarks it seems the changes made by @ldkingvivi make it even faster.

See here: https://github.com/ldkingvivi/cidranger/blob/master/iptire/trie.go

ldkingvivi commented 6 months ago

@phemmer my change was mainly adding Entry with generic, so containing and cover networking will all return something instead of pure prefix, and of course some tests and benchmark using existing way how cidrange tested/benchmakred. Do you want me create a PR to your repo? I will likely have time during weekend.

https://github.com/ldkingvivi/cidranger/commit/d5fcd23f8e3316a3b633cfa3b10911b47be08e01

phemmer commented 6 months ago

@phemmer my change was mainly adding Entry with generic

I had considered using generics, but stuck with any types as the usage is simpler (in some scenarios). I think using generics is a valid design that some might prefer, but I think it should likely be a new package entirely. It's possible one might be able to implement a design using any types on top of a generics implementation. But I didn't feel the effort was justified. If you do have a solid design that still allows both to be used, then it might be worth considering.

containing and cover networking will all return something instead of pure prefix

The Contains method stops searching on the first (largest) network it finds. Thus it's not going to be have access to the data from the most specific network (largest prefix) entry. If you do want the data from the largest network that matches, you can use the FindLargest method instead. I don't want to change Contains to be anything other than a boolean.

Having CoveredNetworks return the data, or something which allows obtaining both the data and the network might be valid. However I feel like a good design that allows retrieving the data along with the network itself would be rather complicated. Though I could be wrong, as I may not have considered all possibilities. I'm instead considering ripping out CoveredNetworks entirely. It's there because it was in cidranger. However I don't know of any valid use cases for it (if you have one, I'd love to hear it). So rather than expand its capability, I'm more inclined to remove it until someone actually needs it. I would rather provide minimal functionality, and expand it in the future, than have to maintain useless functionality in perpetuity.

and of course some tests and benchmark using existing way how cidrange tested/benchmakred

I omitted tests from my gist because gists are meant to be concise, not because they didn't exist. They exist, and were published in the package.

As far as benchmarks, the package also contains a benchmarking suite. If you don't feel the benchmarks are meaningful, I'd be willing to consider additional scenarios. I do know that the random networks aren't likely representative of real-world scenarios, as it creates too many networks which are too large. I've considered solving that using a curve to make smaller prefixes more rare. But I don't think it'd change the overall results significantly, so I haven't bothered.

Do you want me create a PR to your repo? I will likely have time during weekend.

If you like, though I'd recommend proposing specific changes for discussion first. The proposals should be in that repo as well, as this one should be for discussions about cidranger only, and I feel we've hijacked it enough.

ldkingvivi commented 6 months ago

I didn't referring to Contains, but ContainingNetworks. Also for the comments about tests and benchmark, I just simply mentioned the fact what I added, nothing criticize your gist. At this point, I will just continue use my branch directly, since I do have use case to fetch data other than just prefix, as well as generic for my use case.

One thing I agree, we probably hijacked this place enough, I will stop doing that now.