Performance tuning - Githubissues

karussell commented 1 year ago

We tried to integrate the library into GraphHopper but we are currently facing a rather big slow down. E.g. for this PBF we normally have an import time of around 23sec and with the latest integration this takes ~60-70s. For planet PBF the slow down exists too but the bigger problem in these cases are the memory errors like GC overhead limit exceeded that don't go away even with 20GB more RAM.

As the GC seems to be very active it could be also the reason for the slow down but I'm unsure as although I gave it several GBs it wasn't faster. I also tried to reduce the regex compilations with a cache but without success.

What only helped was to avoid calling limitSpeedsTo. For that I replaced createResultTags with:

if (exactRoadType != null) {
  val result = exactRoadType.tags.toMutableMap()
  result.putAll(tags.filter { !it.isImplicitMaxSpeed })
  return Result(exactRoadType.name, result, Certitude.Exact)
}

Is it possible to make calling this optional? The reduction of the other speeds is not really needed in our case as we would reduce it ourselves via Math.min which would be simpler+faster. (Or maybe do this only when fetched from the map but a custom map implementation would be necessary)

I tried to avoid the pattern matching and used indexOfs instead and the improvement was significant but not enough for us (only down to 33sec).

westnordost commented 1 year ago

Do you have a benchmark result, i.e. how much time was spent in which functions?

I see a few places that maybe could be optimized, namely the new instance creation of the Regexes (or the use of the regexes altogether) and the copying of lists and maps. But it would be preliminary optimization without seeing if it actually matters.

limitSpeedsTo is not optional, otherwise the library would output for input of e.g.

maxspeed=60
highway=motorway

e.g. an HGV speed like this

maxspeed:hgv=100

(because the legislation says that a HGV may not go faster than 100, but that also means that it may not go faster than any general speed limit posted; the mentioned function guarantees that).

The requirement of this library is certainly to be correct, so I don't think we can refrain from doing that.

westnordost commented 1 year ago

I tried to avoid the pattern matching and used indexOfs instead and the improvement was significant but not enough for us (only down to 33sec).

That's nice, though. Please feel free to create a PR for that already. (One thing after another.)

karussell commented 1 year ago

The PR #8 improved it only slightly (down to 50sec). Not sure what I measured before.

Maybe you can investigate how to integrate https://github.com/Kotlin/kotlinx-benchmark ? It seems it requires kotlin 1.7.20 though. Unfortunately I fail to do this due to my lack of gradle knowledge. With such a benchmark we could also check #8.

limitSpeedsTo is not optional, otherwise the library would output for input of e.g.

Yes, I understand this. But for us this is not important. Currently we fetch only the maxspeed (for urban+rural). But even if we fetch the maxspeed:hgv at some point it will be much faster to do Math.min(maxspeedHGV, maxspeed) rather than calling limitSpeedsTo. And currently we don't conditional values too, so this is also something that we'll approach in a later step.

The requirement of this library is certainly to be correct, so I don't think we can refrain from doing that.

What if you could provide a class that does the limitToSpeed only up on retrieval? Then the expensive limitSpeedsTo is avoided for our use case as we would only fetch maxspeed and still the library provides a correct result.

Here is a small preliminary overview for our import times when removing certain code parts from this library:

code with #8 => 50s
code with #8 but remove createResultTags and directly return roadTypeTags => 25s (instead of 23s)
code with #8 but remove limitSpeedsTo in createResultTags => 27s
code with #8 but no subkeys in limitSpeedsTo => 38s
code with #8 but no maxspeed != null in limitSpeedsTo => 39s

Of course several tests do not pass but just to indicate the most expensive code parts.

The problem is also my lack of kotlin knowledge and so even a simple stopwatch measurement for the different code parts turns out to be tricky for me.

westnordost commented 1 year ago

What would probably somewhat speed up the results for large input data - independent of any performance optimizations within this library - would be to cache the results. I.e. have a map like

// Map<input tags, result>
Map<Map<String, String>, LegalDefaultSpeeds.Result>

and only calculate it if not already present in the map.

The memory hog-ness of such a cache could be somewhat reduced if only the tags on the way relevant to default speed limits would be considered (or the tags of the input ways already be cleaned from irrelevant tags)

westnordost commented 1 year ago

Is it maybe the conditionals-stuff that takes this long?

westnordost commented 1 year ago

Actually, regarding the cache: One can just use a (large) LRU cache to guarantee it to not exceed a certain memory limit and the most common taggings are memorized.

westnordost commented 1 year ago

I've just replaced the creation of a list and removing elements from there with doing the same with a sequence. Let me know if it makes any difference.

westnordost commented 1 year ago

Regarding the conditionals stuff, the current situation is a bit awkward:

The Python parser of the wiki page outputs speed limits in the JSON like this:

"tags": {
  "maxspeed": "100",
  "maxspeed:goods": "80",
  "maxspeed:goods:conditional": "70 @ (weightrating>3.5); 60 @ (weightrating>12)",
},

only for the kotlin library having to pull apart and glue back together these values again in the limitSpeedsTo function.

It may be easier to consume if the output of the JSON generator would be e.g. something like this (or similar):

"tags": {
  "maxspeed": {
    "speed": "100"
  },
  "maxspeed:goods": {
    "speed": "80",
    "conditionals": [
      {"speed": "70", "condition": "weightrating>3.5" },
      {"speed": "60", "condition": "weightrating>12" },
    ]
  },
},

However, the only thing that would saved would be

the splitting of the string by "; "
the splitting of the substring by " @ "

So, not sure if this does really make such a difference, given to change this one has to both change the python script and also the kotlin library parsing it.

But maybe you can check this with your benchmark setup, how much time is spent on that.

karussell commented 1 year ago

Thanks a lot! Your last improvements got the import down to ~40-45sec! (ie. the overhead went down from 27s to 17s)

westnordost commented 1 year ago

So, how much of this overhead is the splitting of strings with ; and @?

karussell commented 1 year ago

It seems to be neglectable. Maybe the map entry removal is the problem. Unfortunately I run out of time. I will also try to get the JMH benchmarking working here so that you don't have to rely on my words update: current state does not work :(

But would you accept a PR where the limiting of the other speeds could be disabled?

westnordost commented 1 year ago

No, not really. If at all, then per another method, something like getRoadType. It doesn't make a lot of sense to me to not filter out the invalid results. Even if you say, well, we don't need speed limits beyond the generic one YET, you will at one point, at which point we are back at the point we are now. Also, I don't understand why an import time of 1 minute (as opposed to 30s) would be problematic?

Anyway, if you have the need for the speed, an LRU cache as I mentioned should solve all the issues you have and practically reduce the overhead to 0 for large data imports.

Regarding the string splitting, i think the easiest way to measure the impact is to add another string split of the same to the code (but do nothing more with the result) and see how much it differs from the current version. I don't think it can be negligible. Even without creation of new lists (as per my change), it should cost at least iterating all the characters of all conditionals and creating new string objects from that

karussell commented 1 year ago

something like getRoadType

Can you describe what you have in mind?

Also, I don't understand why an import time of 1 minute (as opposed to 30s) would be problematic?

That was a tiny test PBF. We deal with planet PBFs and currently due to this library we would require 50% more RAM which is a no go if you already need 80GB RAM and everything else is highly optimized. Furthermore the slow down is prohibitive for these cases too so instead of 2h it would take 4h or more (have not finished it yet due to the RAM constraint).

we don't need speed limits beyond the generic one YET, you will at one point, at which point we are back at the point we are now.

No. As I wrote: if I pick e.g. maxspeed:hgv I will just reduce it using the maxspeed via a cheap Math.min.

Regarding the string splitting, i think the easiest way to measure the impact

Even without this there are other places that introduce the problem.

I will also have to investigate our library usage once again and see if we can reduce the load somehow. Because it seems that even without limitSpeedsTo there seems to be some extreme overhead.

westnordost commented 1 year ago

Ah, I see. I'd imagine that creating another hashmap for every single road instead of modifying the input map could be a source of overhead. Also, the matching of each road to the road type is of course a certain "overhead", in other words, the core functionality of the library. The matching can also include regex matching etc., so that comes at a certain cost that cannot be avoided.

Something like

fun getRoadType(
        countryCode: String,
        tags: Map<String, String>,
        relationsTags: List<Map<String, String>> = emptyList(),
        replacerFn: (name: String, evaluate: () -> Boolean) -> Boolean = { _, ev -> ev() }
    ): AnotherResult?

where AnotherResult would contain the certitude, the matched road type name and the "raw" tags associated with that road type is what i imagine would be fine to include.

But uhm, as I wrote, a lru cache that caches the results associated by their input values (output for the same input values will be the same, after all) will solve any performance issues.

karussell commented 1 year ago

But uhm, as I wrote, a lru cache that caches the results associated by their input values (output for the same input values will be the same, after all) will solve any performance issues.

Will try

otbutz commented 1 year ago

You could use async-profiler to track CPU usage and memory allocations.

karussell commented 1 year ago

I'm now limiting the tags to only the necessary ones which speeds up the import. Here are the times for a larger PBF (Germany)

Xmx5000m, empty speed parser => pass2: 345s+-1 (lib not called)
Xmx5000m, all tags           => pass2: ~798s   (no workaround)
Xmx5000m, all tags           => pass2: ~414s   (with workaround #9)
Xmx5000m, all tags+cache     => pass2: ~960s   (no workaround)
Xmx5000m, filtered tags      => pass2: 742s+-9 (no workaround)
Xmx5000m, filtered tags      => pass2: 392s+-1 (with workaround #9)
Xmx5000m, filtered+cache     => pass2: 369s+-3 (no workaround)
Xmx5000m, filtered+cache     => pass2: 362s+-4 (with workaround #9)
Xmx7000m, filtered+cache     => pass2: 336s    (no workaround #9)

Not sure why this helps. It could be also the maxspeed.*conditional tags which I'm currently rejecting.

Unfortunately this does not fix the memory problems for the planet PBF and I currently need to increase the Xmx setting to 100g to make it running without a GC or heap error but our master branch is fine with 80g! Will have to investigate it further...

westnordost commented 1 year ago

That a cache has very little effect or even a negative effect really puzzles me... how can effectively not calling the library (i.e. only call it for uniques) be about as fast as calling the library?

Though, that the key of your cache is a string is a little suspect. Might be a bit costly to create a string from each and every tags of each and every road segment. I'd have created a data structure with hashCode that'd contain the tags map, country alpha2.

karussell commented 1 year ago

how can effectively not calling the library (i.e. only call it for uniques) be about as fast as calling the library?

A cache has always the cost of calculating the cache key which is expensive in this case: the more tags there are and the longer they are the more expensive the hashCode calculation is.

I'd have created a data structure with hashCode that'd contain the tags map, country alpha2.

But any map works exactly like this (?): it uses the hashCode (here from the string) as "bucket index" and only uses the expensive object comparison if the hashcode is the same. I did also try to store the tags-Map itself as a cache key but this made no big difference in the measured overhead: we avoid string concatenation for the key but still have to calculate the hashCode from all these strings.

westnordost commented 1 year ago

Yes, but the String concatenation is avoidable, i.e. is extra cost.

otbutz commented 1 year ago

JMC also provides a solid automated analysis which might help here :)

https://www.oracle.com/java/technologies/javase/products-jmc8-downloads.html

karussell commented 1 year ago

What I find suspicious is that when I do some simple perf tests that the "rural" calculation always takes approx 3-5 times longer than the "urban" calculation. Regardless of the tags it seems:

0 urban sum:4.028s, time/call:4.028µs, dummy: 1333333
0 rural sum:13.02s, time/call:0.013ms, dummy: 10666664
1 urban sum:3.489s, time/call:3.489µs, dummy: 1333333
1 rural sum:12.315s, time/call:0.012ms, dummy: 10666664
2 urban sum:3.478s, time/call:3.478µs, dummy: 1333333
2 rural sum:17.142s, time/call:0.017ms, dummy: 10666664
3 urban sum:2.93s, time/call:2.93µs, dummy: 1333333
3 rural sum:15.677s, time/call:0.016ms, dummy: 10666664

Am I utilizing the library correctly?

Update: those are the many "conditional" clauses. As we don't need them I'll exclude them. Now rural and urban are nearly the same speed.

westnordost commented 1 year ago

Update: those are the many "conditional" clauses. As we don't need them I'll exclude them. Now rural and urban are nearly the same speed.

So, the string splitting does make a difference after all?

karussell commented 1 year ago

Update: those are the many "conditional" clauses. As we don't need them I'll exclude them. Now rural and urban are nearly the same speed.

Interestingly this overhead makes no difference for our import. Probably because of the cache.

So, the string splitting does make a difference after all?

You mean master vs. 1.2? Have not tested it and always used master since then.

btw: would you be able to release a new version?

Yes, but the String concatenation is avoidable, i.e. is extra cost.

Have done this although the benefits are (strangely) not really measurable.

Unfortunately this does not fix the memory problems for the planet PBF and I currently need to increase the Xmx setting to 100g to make it running without a GC or heap error but our master branch is fine with 80g!

This was due to some embarrassing mistake I made where I compared apples with oranges ... no comment ;) ... so planet runs through with 83g, which is a bit more than expected but completely acceptable.

karussell commented 1 year ago

I think we can close here.

And thank you for your support, feedback and ideas!

westnordost commented 1 year ago

I see you already did! Some notes:

btw: would you be able to release a new version?

Could do, need to check if I have the keys with me (I'm traveling right now)

So, the string splitting does make a difference after all?

You mean master vs. 1.2? Have not tested it and always used master since then.

No, I mean the splitting of the strings done here

https://github.com/westnordost/osm-legal-default-speeds/blob/1554cc5a5d0c20d2bd71b1dcfa92cbab3acd6172/library/src/commonMain/kotlin/de/westnordost/osm_legal_default_speeds/LegalDefaultSpeeds.kt#L237-L240

which would be avoidable if the JSON was changed like outlined here https://github.com/westnordost/osm-legal-default-speeds/issues/7#issuecomment-1558760541

karussell commented 1 year ago

Could do, need to check if I have the keys with me (I'm traveling right now)

Nice, thanks. And no hurry.

So, the string splitting does make a difference after all?

Ah, ok. It will have an impact but due to the caching that we use the impact is probably neglectable. E.g. with #9 I was able to improve it but <5% so that it wasn't worth at the end to use this and I guess the splitting would be visible in the measurement but even smaller.

westnordost / osm-legal-default-speeds

Performance tuning #7