well-typed / optics

Optics as an abstract interface
376 stars 24 forks source link

Missing INLINE pragma on castOptic? #469

Closed maralorn closed 1 year ago

maralorn commented 2 years ago

I am fairly new to profiling/performance analysis, so this might be a wild goose chase.

Still, when profiling my application I notice, that in my hot code paths there often appears castOptic. The haddocks for castOptic say, that it is basically a no-op, which makes sense. And the eventlog does not claim, that there is a lot of time being spent in castOptic, yet I fear that it is preventing further optimisation from happening, because ghc can‘t simplify it away.

Is there a way around this? Will ghc simply inline castOptic when I am not in profiling mode? I fear inlining it might not work without specialising implies anyway?

If I can help by doing any specific debugging, I’d be glad to help.

phadej commented 2 years ago

profiling does get into way of optimizations. It's results are unreliable. Try -fprof-late with GHC 9.4

maralorn commented 2 years ago

I have built my project with ghc 9.4 and with

cabal run --enable-profiling --profiling-detail=none --ghc-option -fprof-late --ghc-option -with-rtsopts=-pj

It appears the prof-late does not apply to the dependencies, but only to my project. Is this what you wanted me to do?

castOptic and beneath it repIso, Lens.first, Lens.second are still present in the most expensive function of my program.

I don’t believe that you can tell anything from a speedscope screenshot, but maybe it illustrates, vaguely, what I am talking about:

image

I am not sure if the simplifier could profit from inlining castOptic, but I think it’s plausible that it could help.

(maybe for context: In this benchmark this processJsonMessage function is being called on over 1M json messages, so squeezing even a tiny bit out of that would probably help perfomance.)

arybczak commented 2 years ago

This might be similar to #324.

arybczak commented 2 years ago

Also, @maralorn where does the optics code that doesn't optimize come from? Is it hand written?

arybczak commented 2 years ago

Also, optics doesn't define repIso. This function comes from generic-lens-core. For the record, optics provides built-in generic lenses and prisms, not sure why do you use generic-optics.

maralorn commented 2 years ago

@arybczak Thank you for helping me understanding this.

You are right, I was using generic-optics here. I didn’t know that optics has generic lenses, thx for the hint. I removed generic-optics from my project. Replacing field with gfield actually made the function 25% faster (in this benchmark), so that’s a huge win!

Still castOptic remains in the eventlog. My code in questions is the following function:

(gfield @"activities" %~ IntMap.adjust (_3 ?~ progress) id'.value)

I apply it to a record which has a field called activities of type IntMap _.

I wonder if the problem is, that I haven’t yet brought cabal to build optics with -fprof-late. I think the -fprof-late currently only applies to my project. Maybe castOptic will disappear when I get it to do that? Although apparently all the operators from optics which I am using seem to get inlined correctly. Only castOptic and # are remaining.

arybczak commented 2 years ago

Replacing field with gfield actually made the function 25% faster (in this benchmark), so that’s a huge win!

That's great news :+1: FYI you can use gfield via OverloadedLabels, like this:

(#activities %~ IntMap.adjust (_3 ?~ progress) id'.value)

Regarding the topic, let's experiment. Can you try #470 and see whether it fixes the problem?

maralorn commented 2 years ago

(I am not sure, why I refrained from using OverloadedLabels, I think it’s because sometimes the error messages get more confusing.)

Well, yes, applying that patch makes castOptic disappear from the eventlog. One operator remained: #. so I additionally created https://github.com/well-typed/optics/commit/4d7be0d0f4e7956e7ea9fce649bab8999044b0e8 and now presumably everything get’s inlined, at least there are no visible cost centers in the function anymore, I haven‘t looked at the core though.

Now I don‘t know if the function got faster again, I have run the benchmark muitiple times and it looks like it might be a improvement of about 10% but it might not be statistically significant.

So yes, castOptic get’s inlined with that change, I am just not sure how important that is.

arybczak commented 2 years ago

I did an experiment with using review in a code with profiling enabled and I didn't get castOptic there, but I got #. and .#.

I don't see a harm in marking there functions INLINE because they and very small and thus always get inlined when profiling is off anyway, but the more fundamental issue here is that profiling in GHC interferes with optimizations, so other, similar problems will most likely pop in the future :disappointed:

phadej commented 2 years ago

A friendly reminder about -fprof-late.

maralorn commented 2 years ago

Well, if this is really just being caused by the profiler preventing inlining, then let‘s just close this issue.

I hear you @phadej, but then the real issue here is that there are currently no resources on the internet about the relation between -fprof-late and dependencies build with cabal. I am actually not even sure what -fprof-late is supposed to do when building a library …

phadej commented 2 years ago

the real issue here is that there are currently no resources on the internet about the relation between -fprof-late and dependencies build with cabal.

... which library maintainers cannot really address. Doing stuff with cabal (or tbh, any build tool) which is just recently got added to GHC is often troublesome. Indeed, I have no good idea how to pass profiling specific flags to dependency package builds, and only when cabal profiling build is done. (My first idea is to build a custom cabal which does -fprof-late instead of -fprof-auto, but I wouldn't call that user friendly, nor really recommend to anyone).

maralorn commented 2 years ago

Yeah, it’s obviously not an optics issue. I just insisted, to make sure, that I haven’t been making a stupid mistake.

maralorn commented 2 years ago

Thx, anyway for taking the time to talk this through with me.

arybczak commented 2 years ago

For the record, I experimented with -fprof-late a bit and the trick to get all dependencies to compile with -fprof-late I came up with is to put into your cabal.project.local:

package *
  ghc-options: -fno-prof-auto -fprof-late
siraben commented 1 year ago

@maralorn I'm running into very similar issues with https://discourse.haskell.org/t/debugging-why-late-profiling-detail-still-shows-optics-unexpectedly/6391. I'm trying to experiment with @arybczak's suggestion.

siraben commented 1 year ago

I've created a minimal, reproducible example showing the performance differences between lens and optics while profiling here: https://github.com/siraben/lens-optics-profiling

maralorn commented 1 year ago

Very interesting. Thank you!

arybczak commented 1 year ago

@siraben can you check with #491?

arybczak commented 1 year ago

Ok, I checked it myself and it does indeed fix it.

maralorn commented 1 year ago

I am still likely confused whether this fixes a problem that only happens during profiling and if so, why?

arybczak commented 1 year ago

Looks like GHC doesn't inline small/trivial functions when profiling is enabled unless they are marked with INLINE pragmas. As for why this happens, it would be best to ask GHC devs.

AndreasPK commented 1 year ago

Looks like GHC doesn't inline small/trivial functions when profiling is enabled unless they are marked with INLINE pragmas. As for why this happens, it would be best to ask GHC devs.

I don't think this is the case. Rather cabal ends up invoking ghc with both -fprof-auto and -fprof-late which ends up inserting the cost centres implied by both modes.

I've created a minimal, reproducible example showing the performance differences between lens and optics while profiling here: https://github.com/siraben/lens-optics-profiling

This was very helpful. At least in the reproducer this issue is that cabal will pass "-fprof-auto" while you then later pass -fprof-late manually. These two don't cancel each other out. Rather ghc will insert both the cost-centres for "-fprof-auto" and for "-fprof-late". This is because there are edge cases where one might want to combine these profiling modes. But perhaps that's not worth the confusion.

Anyway to properly use prof-late for just your project either use:

cabal run --enable-profiling {lens|optics}-profiling --ghc-options='-O2 -fprof-late' --profiling-detail=none -- +RTS -p -RTS

Or for newer versions of cabal

cabal run --enable-profiling {lens|optics}-profiling --ghc-options='-O2' --profiling-detail=late-- +RTS -p -RTS

As others already stated this doesn't apply prof-late to dependencies which is a shortcomming of cabal. To work around this you can use the cabal.project trick mentioned further up. So something like:

package *
  ghc-options: -fno-prof-auto -fprof-late
  profiling-detail: none

If there is a reproducer showing clearly that -fprof-late actually prevents inlining I would be very interested to take a look at it. But using the flags as I described it seems to work as expected for this reproducer.

arybczak commented 1 year ago

@AndreasPK I've run this repo with the following cabal.project.local:

with-compiler: ghc-9.4.5

profiling: True
profiling-detail: late

tests: True

and here's the difference:

lens-profiling:

        Sun Jun 18 15:49 2023 Time and Allocation Profiling Report  (Final)

           lens-profiling +RTS -s -p -RTS

        total time  =        0.02 secs   (19 ticks @ 1000 us, 1 processor)
        total alloc =  40,051,984 bytes  (excludes profiling overheads)

COST CENTRE MODULE SRC                 %time %alloc

$wloop      Main   <no location info>  100.0   99.9

                                                                                    individual      inherited
COST CENTRE       MODULE                SRC                      no.     entries  %time %alloc   %time %alloc

MAIN              MAIN                  <built-in>               162           0    0.0    0.0   100.0  100.0
 main             Main                  app/Main-lens.hs:29:1-4  324           1    0.0    0.0     0.0    0.0
  main7           Main                  <no location info>       325           1    0.0    0.0     0.0    0.0
   main1          Main                  <no location info>       326           1    0.0    0.0     0.0    0.0
 CAF              Main                  <entire-module>          323           0    0.0    0.0   100.0   99.9
  $fShowBox13     Main                  <no location info>       333           1    0.0    0.0     0.0    0.0
  main2           Main                  <no location info>       327           1    0.0    0.0     0.0    0.0
  main3           Main                  <no location info>       328           1    0.0    0.0     0.0    0.0
   $w$cshowsPrec2 Main                  app/Main-lens.hs:9:17-20 329           1    0.0    0.0     0.0    0.0
  main4           Main                  <no location info>       330           1    0.0    0.0   100.0   99.9
   $wloop         Main                  <no location info>       331     1000001  100.0   99.9   100.0   99.9
  main5           Main                  <no location info>       332           1    0.0    0.0     0.0    0.0
 CAF              GHC.Conc.Signal       <entire-module>          284           0    0.0    0.0     0.0    0.0
 CAF              GHC.IO.Encoding       <entire-module>          269           0    0.0    0.0     0.0    0.0
 CAF              GHC.IO.Encoding.Iconv <entire-module>          267           0    0.0    0.0     0.0    0.0
 CAF              GHC.IO.Handle.FD      <entire-module>          259           0    0.0    0.1     0.0    0.1

optics-profiling:

        Sun Jun 18 15:49 2023 Time and Allocation Profiling Report  (Final)

           optics-profiling +RTS -s -p -RTS

        total time  =        0.02 secs   (24 ticks @ 1000 us, 1 processor)
        total alloc =  40,051,784 bytes  (excludes profiling overheads)

COST CENTRE MODULE                SRC                                            %time %alloc

increment3  Main                  <no location info>                              58.3   39.9
$wloop      Main                  <no location info>                              25.0   59.9
%           Optics.Internal.Optic src/Optics/Internal/Optic.hs:(120,1)-(130,54)   16.7    0.0

                                                                                                           individual      inherited
COST CENTRE       MODULE                  SRC                                           no.     entries  %time %alloc   %time %alloc

MAIN              MAIN                    <built-in>                                    139           0    0.0    0.0   100.0  100.0
 main             Main                    app/Main-optics.hs:36:1-4                     278           1    0.0    0.0     0.0    0.0
  main7           Main                    <no location info>                            279           1    0.0    0.0     0.0    0.0
   main1          Main                    <no location info>                            280           1    0.0    0.0     0.0    0.0
 CAF              Main                    <entire-module>                               277           0    0.0    0.0   100.0   99.9
  $fShowBox13     Main                    <no location info>                            295           1    0.0    0.0     0.0    0.0
  increment2      Main                    <no location info>                            286           1    0.0    0.0     0.0    0.0
   castOptic      Optics.Internal.Optic   src/Optics/Internal/Optic.hs:(102,1)-(108,44) 288           1    0.0    0.0     0.0    0.0
    %             Optics.Internal.Optic   src/Optics/Internal/Optic.hs:(120,1)-(130,54) 289           2    0.0    0.0     0.0    0.0
   runFunArrow    Data.Profunctor.Indexed src/Data/Profunctor/Indexed.hs:78:37-47       287           1    0.0    0.0     0.0    0.0
  main2           Main                    <no location info>                            281           1    0.0    0.0     0.0    0.0
  main3           Main                    <no location info>                            282           1    0.0    0.0     0.0    0.0
   $w$cshowsPrec2 Main                    app/Main-optics.hs:16:26-29                   283           1    0.0    0.0     0.0    0.0
  main4           Main                    <no location info>                            284           1    0.0    0.0   100.0   99.9
   $wloop         Main                    <no location info>                            285     1000001   25.0   59.9   100.0   99.9
    increment2    Main                    <no location info>                            290           0    0.0    0.0    75.0   39.9
     castOptic    Optics.Internal.Optic   src/Optics/Internal/Optic.hs:(102,1)-(108,44) 291           0    0.0    0.0    75.0   39.9
      %           Optics.Internal.Optic   src/Optics/Internal/Optic.hs:(120,1)-(130,54) 292           0   16.7    0.0    75.0   39.9
       increment3 Main                    <no location info>                            293     1000000   58.3   39.9    58.3   39.9
  main5           Main                    <no location info>                            294           1    0.0    0.0     0.0    0.0
 CAF              GHC.Conc.Signal         <entire-module>                               265           0    0.0    0.0     0.0    0.0
 CAF              GHC.IO.Encoding         <entire-module>                               250           0    0.0    0.0     0.0    0.0
 CAF              GHC.IO.Encoding.Iconv   <entire-module>                               248           0    0.0    0.0     0.0    0.0
 CAF              GHC.IO.Handle.FD        <entire-module>                               240           0    0.0    0.1     0.0    0.1

Only after #491 it becomes:

        Sun Jun 18 15:53 2023 Time and Allocation Profiling Report  (Final)

           optics-profiling +RTS -s -p -RTS

        total time  =        0.02 secs   (19 ticks @ 1000 us, 1 processor)
        total alloc =  40,051,768 bytes  (excludes profiling overheads)

COST CENTRE MODULE SRC                 %time %alloc

$wloop      Main   <no location info>  100.0   99.9

                                                                                       individual      inherited
COST CENTRE       MODULE                SRC                         no.     entries  %time %alloc   %time %alloc

MAIN              MAIN                  <built-in>                  127           0    0.0    0.0   100.0  100.0
 main             Main                  app/Main-optics.hs:36:1-4   254           1    0.0    0.0     0.0    0.0
  main7           Main                  <no location info>          255           1    0.0    0.0     0.0    0.0
   main1          Main                  <no location info>          256           1    0.0    0.0     0.0    0.0
 CAF              Main                  <entire-module>             253           0    0.0    0.0   100.0   99.9
  $fShowBox13     Main                  <no location info>          263           1    0.0    0.0     0.0    0.0
  main2           Main                  <no location info>          257           1    0.0    0.0     0.0    0.0
  main3           Main                  <no location info>          258           1    0.0    0.0     0.0    0.0
   $w$cshowsPrec2 Main                  app/Main-optics.hs:16:26-29 259           1    0.0    0.0     0.0    0.0
  main4           Main                  <no location info>          260           1    0.0    0.0   100.0   99.9
   $wloop         Main                  <no location info>          261     1000001  100.0   99.9   100.0   99.9
  main5           Main                  <no location info>          262           1    0.0    0.0     0.0    0.0
 CAF              GHC.Conc.Signal       <entire-module>             248           0    0.0    0.0     0.0    0.0
 CAF              GHC.IO.Encoding       <entire-module>             238           0    0.0    0.0     0.0    0.0
 CAF              GHC.IO.Encoding.Iconv <entire-module>             236           0    0.0    0.0     0.0    0.0
 CAF              GHC.IO.Handle.FD      <entire-module>             228           0    0.0    0.1     0.0    0.1
arybczak commented 1 year ago

Ok, I didn't see this bit:

As others already stated this doesn't apply prof-late to dependencies which is a shortcomming of cabal. To work around this you can use the cabal.project trick mentioned further up.

Looks like a bug in cabal to me if profiling affects dependencies, but profiling-detail doesn't.

siraben commented 1 year ago

@arybczak the new changes help, thank you! Would you be able to make a release to Hackage that incorporates these changes?

arybczak commented 1 year ago

I will after #492 is merged.