Closed maralorn closed 1 year ago
profiling does get into way of optimizations. It's results are unreliable. Try -fprof-late
with GHC 9.4
I have built my project with ghc 9.4 and with
cabal run --enable-profiling --profiling-detail=none --ghc-option -fprof-late --ghc-option -with-rtsopts=-pj
It appears the prof-late
does not apply to the dependencies, but only to my project. Is this what you wanted me to do?
castOptic
and beneath it repIso
, Lens.first
, Lens.second
are still present in the most expensive function of my program.
I don’t believe that you can tell anything from a speedscope screenshot, but maybe it illustrates, vaguely, what I am talking about:
I am not sure if the simplifier could profit from inlining castOptic
, but I think it’s plausible that it could help.
(maybe for context: In this benchmark this processJsonMessage
function is being called on over 1M json messages, so squeezing even a tiny bit out of that would probably help perfomance.)
This might be similar to #324.
Also, @maralorn where does the optics code that doesn't optimize come from? Is it hand written?
Also, optics
doesn't define repIso
. This function comes from generic-lens-core
. For the record, optics
provides built-in generic lenses and prisms, not sure why do you use generic-optics
.
@arybczak Thank you for helping me understanding this.
You are right, I was using generic-optics
here. I didn’t know that optics has generic lenses, thx for the hint. I removed generic-optics
from my project. Replacing field
with gfield
actually made the function 25% faster (in this benchmark), so that’s a huge win!
Still castOptic
remains in the eventlog. My code in questions is the following function:
(gfield @"activities" %~ IntMap.adjust (_3 ?~ progress) id'.value)
I apply it to a record which has a field called activities
of type IntMap _
.
I wonder if the problem is, that I haven’t yet brought cabal to build optics
with -fprof-late
. I think the -fprof-late
currently only applies to my project. Maybe castOptic
will disappear when I get it to do that? Although apparently all the operators from optics
which I am using seem to get inlined correctly. Only castOptic
and #
are remaining.
Replacing field with gfield actually made the function 25% faster (in this benchmark), so that’s a huge win!
That's great news :+1: FYI you can use gfield
via OverloadedLabels
, like this:
(#activities %~ IntMap.adjust (_3 ?~ progress) id'.value)
Regarding the topic, let's experiment. Can you try #470 and see whether it fixes the problem?
(I am not sure, why I refrained from using OverloadedLabels, I think it’s because sometimes the error messages get more confusing.)
Well, yes, applying that patch makes castOptic disappear from the eventlog. One operator remained: #.
so I additionally created https://github.com/well-typed/optics/commit/4d7be0d0f4e7956e7ea9fce649bab8999044b0e8 and now presumably everything get’s inlined, at least there are no visible cost centers in the function anymore, I haven‘t looked at the core though.
Now I don‘t know if the function got faster again, I have run the benchmark muitiple times and it looks like it might be a improvement of about 10% but it might not be statistically significant.
So yes, castOptic
get’s inlined with that change, I am just not sure how important that is.
I did an experiment with using review
in a code with profiling enabled and I didn't get castOptic
there, but I got #.
and .#
.
I don't see a harm in marking there functions INLINE because they and very small and thus always get inlined when profiling is off anyway, but the more fundamental issue here is that profiling in GHC interferes with optimizations, so other, similar problems will most likely pop in the future :disappointed:
A friendly reminder about -fprof-late
.
Well, if this is really just being caused by the profiler preventing inlining, then let‘s just close this issue.
I hear you @phadej, but then the real issue here is that there are currently no resources on the internet about the relation between -fprof-late and dependencies build with cabal. I am actually not even sure what -fprof-late is supposed to do when building a library …
the real issue here is that there are currently no resources on the internet about the relation between -fprof-late and dependencies build with cabal.
... which library maintainers cannot really address. Doing stuff with cabal
(or tbh, any build tool) which is just recently got added to GHC is often troublesome. Indeed, I have no good idea how to pass profiling specific flags to dependency package builds, and only when cabal
profiling build is done. (My first idea is to build a custom cabal
which does -fprof-late
instead of -fprof-auto
, but I wouldn't call that user friendly, nor really recommend to anyone).
Yeah, it’s obviously not an optics issue. I just insisted, to make sure, that I haven’t been making a stupid mistake.
Thx, anyway for taking the time to talk this through with me.
For the record, I experimented with -fprof-late
a bit and the trick to get all dependencies to compile with -fprof-late
I came up with is to put into your cabal.project.local
:
package *
ghc-options: -fno-prof-auto -fprof-late
@maralorn I'm running into very similar issues with https://discourse.haskell.org/t/debugging-why-late-profiling-detail-still-shows-optics-unexpectedly/6391. I'm trying to experiment with @arybczak's suggestion.
I've created a minimal, reproducible example showing the performance differences between lens and optics while profiling here: https://github.com/siraben/lens-optics-profiling
Very interesting. Thank you!
@siraben can you check with #491?
Ok, I checked it myself and it does indeed fix it.
I am still likely confused whether this fixes a problem that only happens during profiling and if so, why?
Looks like GHC doesn't inline small/trivial functions when profiling is enabled unless they are marked with INLINE pragmas. As for why this happens, it would be best to ask GHC devs.
Looks like GHC doesn't inline small/trivial functions when profiling is enabled unless they are marked with INLINE pragmas. As for why this happens, it would be best to ask GHC devs.
I don't think this is the case. Rather cabal ends up invoking ghc with both -fprof-auto and -fprof-late which ends up inserting the cost centres implied by both modes.
I've created a minimal, reproducible example showing the performance differences between lens and optics while profiling here: https://github.com/siraben/lens-optics-profiling
This was very helpful. At least in the reproducer this issue is that cabal will pass "-fprof-auto" while you then later pass -fprof-late manually. These two don't cancel each other out. Rather ghc will insert both the cost-centres for "-fprof-auto" and for "-fprof-late". This is because there are edge cases where one might want to combine these profiling modes. But perhaps that's not worth the confusion.
Anyway to properly use prof-late for just your project either use:
cabal run --enable-profiling {lens|optics}-profiling --ghc-options='-O2 -fprof-late' --profiling-detail=none -- +RTS -p -RTS
Or for newer versions of cabal
cabal run --enable-profiling {lens|optics}-profiling --ghc-options='-O2' --profiling-detail=late-- +RTS -p -RTS
As others already stated this doesn't apply prof-late to dependencies which is a shortcomming of cabal. To work around this you can use the cabal.project trick mentioned further up. So something like:
package *
ghc-options: -fno-prof-auto -fprof-late
profiling-detail: none
If there is a reproducer showing clearly that -fprof-late actually prevents inlining I would be very interested to take a look at it. But using the flags as I described it seems to work as expected for this reproducer.
@AndreasPK I've run this repo with the following cabal.project.local
:
with-compiler: ghc-9.4.5
profiling: True
profiling-detail: late
tests: True
and here's the difference:
lens-profiling:
Sun Jun 18 15:49 2023 Time and Allocation Profiling Report (Final)
lens-profiling +RTS -s -p -RTS
total time = 0.02 secs (19 ticks @ 1000 us, 1 processor)
total alloc = 40,051,984 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
$wloop Main <no location info> 100.0 99.9
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 162 0 0.0 0.0 100.0 100.0
main Main app/Main-lens.hs:29:1-4 324 1 0.0 0.0 0.0 0.0
main7 Main <no location info> 325 1 0.0 0.0 0.0 0.0
main1 Main <no location info> 326 1 0.0 0.0 0.0 0.0
CAF Main <entire-module> 323 0 0.0 0.0 100.0 99.9
$fShowBox13 Main <no location info> 333 1 0.0 0.0 0.0 0.0
main2 Main <no location info> 327 1 0.0 0.0 0.0 0.0
main3 Main <no location info> 328 1 0.0 0.0 0.0 0.0
$w$cshowsPrec2 Main app/Main-lens.hs:9:17-20 329 1 0.0 0.0 0.0 0.0
main4 Main <no location info> 330 1 0.0 0.0 100.0 99.9
$wloop Main <no location info> 331 1000001 100.0 99.9 100.0 99.9
main5 Main <no location info> 332 1 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal <entire-module> 284 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding <entire-module> 269 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv <entire-module> 267 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD <entire-module> 259 0 0.0 0.1 0.0 0.1
optics-profiling:
Sun Jun 18 15:49 2023 Time and Allocation Profiling Report (Final)
optics-profiling +RTS -s -p -RTS
total time = 0.02 secs (24 ticks @ 1000 us, 1 processor)
total alloc = 40,051,784 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
increment3 Main <no location info> 58.3 39.9
$wloop Main <no location info> 25.0 59.9
% Optics.Internal.Optic src/Optics/Internal/Optic.hs:(120,1)-(130,54) 16.7 0.0
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 139 0 0.0 0.0 100.0 100.0
main Main app/Main-optics.hs:36:1-4 278 1 0.0 0.0 0.0 0.0
main7 Main <no location info> 279 1 0.0 0.0 0.0 0.0
main1 Main <no location info> 280 1 0.0 0.0 0.0 0.0
CAF Main <entire-module> 277 0 0.0 0.0 100.0 99.9
$fShowBox13 Main <no location info> 295 1 0.0 0.0 0.0 0.0
increment2 Main <no location info> 286 1 0.0 0.0 0.0 0.0
castOptic Optics.Internal.Optic src/Optics/Internal/Optic.hs:(102,1)-(108,44) 288 1 0.0 0.0 0.0 0.0
% Optics.Internal.Optic src/Optics/Internal/Optic.hs:(120,1)-(130,54) 289 2 0.0 0.0 0.0 0.0
runFunArrow Data.Profunctor.Indexed src/Data/Profunctor/Indexed.hs:78:37-47 287 1 0.0 0.0 0.0 0.0
main2 Main <no location info> 281 1 0.0 0.0 0.0 0.0
main3 Main <no location info> 282 1 0.0 0.0 0.0 0.0
$w$cshowsPrec2 Main app/Main-optics.hs:16:26-29 283 1 0.0 0.0 0.0 0.0
main4 Main <no location info> 284 1 0.0 0.0 100.0 99.9
$wloop Main <no location info> 285 1000001 25.0 59.9 100.0 99.9
increment2 Main <no location info> 290 0 0.0 0.0 75.0 39.9
castOptic Optics.Internal.Optic src/Optics/Internal/Optic.hs:(102,1)-(108,44) 291 0 0.0 0.0 75.0 39.9
% Optics.Internal.Optic src/Optics/Internal/Optic.hs:(120,1)-(130,54) 292 0 16.7 0.0 75.0 39.9
increment3 Main <no location info> 293 1000000 58.3 39.9 58.3 39.9
main5 Main <no location info> 294 1 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal <entire-module> 265 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding <entire-module> 250 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv <entire-module> 248 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD <entire-module> 240 0 0.0 0.1 0.0 0.1
Only after #491 it becomes:
Sun Jun 18 15:53 2023 Time and Allocation Profiling Report (Final)
optics-profiling +RTS -s -p -RTS
total time = 0.02 secs (19 ticks @ 1000 us, 1 processor)
total alloc = 40,051,768 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
$wloop Main <no location info> 100.0 99.9
individual inherited
COST CENTRE MODULE SRC no. entries %time %alloc %time %alloc
MAIN MAIN <built-in> 127 0 0.0 0.0 100.0 100.0
main Main app/Main-optics.hs:36:1-4 254 1 0.0 0.0 0.0 0.0
main7 Main <no location info> 255 1 0.0 0.0 0.0 0.0
main1 Main <no location info> 256 1 0.0 0.0 0.0 0.0
CAF Main <entire-module> 253 0 0.0 0.0 100.0 99.9
$fShowBox13 Main <no location info> 263 1 0.0 0.0 0.0 0.0
main2 Main <no location info> 257 1 0.0 0.0 0.0 0.0
main3 Main <no location info> 258 1 0.0 0.0 0.0 0.0
$w$cshowsPrec2 Main app/Main-optics.hs:16:26-29 259 1 0.0 0.0 0.0 0.0
main4 Main <no location info> 260 1 0.0 0.0 100.0 99.9
$wloop Main <no location info> 261 1000001 100.0 99.9 100.0 99.9
main5 Main <no location info> 262 1 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal <entire-module> 248 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding <entire-module> 238 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv <entire-module> 236 0 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD <entire-module> 228 0 0.0 0.1 0.0 0.1
Ok, I didn't see this bit:
As others already stated this doesn't apply prof-late to dependencies which is a shortcomming of cabal. To work around this you can use the cabal.project trick mentioned further up.
Looks like a bug in cabal to me if profiling
affects dependencies, but profiling-detail
doesn't.
@arybczak the new changes help, thank you! Would you be able to make a release to Hackage that incorporates these changes?
I will after #492 is merged.
I am fairly new to profiling/performance analysis, so this might be a wild goose chase.
Still, when profiling my application I notice, that in my hot code paths there often appears
castOptic
. The haddocks for castOptic say, that it is basically a no-op, which makes sense. And the eventlog does not claim, that there is a lot of time being spent in castOptic, yet I fear that it is preventing further optimisation from happening, because ghc can‘t simplify it away.Is there a way around this? Will ghc simply inline
castOptic
when I am not in profiling mode? I fear inlining it might not work without specialisingimplies
anyway?If I can help by doing any specific debugging, I’d be glad to help.