Closed marius311 closed 2 years ago
Thank you so much @marius311, I agree with a lot of this!
I never understood why the separate step where you spline the background. Why not just set up one single system of equations for (H′, Θ′, Θᵖ′) = .... and evolve the whole thing once? You could include reionization parameters in here too, since you're going to recode that?
Yes, this would definitely be more elegant. Julia is probably the only language where this would be efficient -- if I recall correctly, DifferentialEquations.jl attempts to automatically detect the sparsity pattern of the Jacobian (which is effectively what you do when you split up the background and perturbations). I'd want to benchmark this.
Any reason not to use full Planck units and set G=1 as well?
It seems like there's some ambiguity if G = 1 -- I am using the unit system from NaturallyUnitful.jl.
Quick question as I'm getting further. Why did you choose to integrate the hierarchy as N_k separate N_ℓ-dimensional ODE solves rather than one single N_k-by-N_ℓ ODE solve? (I'd guess the latter is going to be much better performance)
automatically detect the sparsity pattern of the Jacobian
I'm not sure you'd even need this tbh. You're right that solving the background quantities first is like assuming that that "upper triangular" of Jacobian over the full set of background+perturbation quatities is zero, but even if you don't tell it that you'll still get the right answer and I suspect the few extra wasted ops are not much worse or even better than the overhead of doing a spline fit in the first place. I may toy around with it, would take a bit of coding.
It seems like there's some ambiguity
Maybe I'm missing something but if I give you a number like 1.23 and tell you its dimensions length and expressed in Planck units, you can unambiguously get its value in e.g. meters.
Quick question as I'm getting further. Why did you choose to integrate the hierarchy as N_k separate N_ℓ-dimensional ODE solves rather than one single N_k-by-N_ℓ ODE solve? (I'd guess the latter is going to be much better performance)
If you simultaneously solve for all k, your adaptive timestep is limited to the worst timestep over all k systems. For example, the stiffness of the system varies with respect to both k and time, so if you used a stiffness-detecting multi-algorithm, you'll be forced to use a stiff solver across your whole system for any time where any k is stiff. Of course, with TCA you can avoid the stiffest parts.
Here's a plot from one of Lesgourgues's talks,
I suspect the performance loss from this is modest (at most 2), so perhaps it is worth it?
If we do pack everything into one ODE solve, this is excellent news for GPU deployment! We'd basically offload all the hard work to Rackauckas et al and GPU matrix mul.
Maybe I'm missing something but if I give you a number like 1.23 and tell you its dimensions length and expressed in Planck units, you can unambiguously get its value in e.g. meters.
Yeah ok, it was also me being a bit lazy. Mason wrote that nice package with ħ = c = ϵ₀ = kb = 1, natural units which are only used by particle physicists. Not quite the same as Planck units (which are indeed probably the best for cosmology), ħ = c = G = kb = 1.
Let me write a quick package...
I suspect the few extra wasted ops are not much worse or even better than the overhead of doing a spline fit in the first place. I may toy around with it, would take a bit of coding.
I'd support putting the background variables into the perturbation system for differentiability reasons as well -- splines from Interpolations.jl are not reverse-diffable.
If you simultaneously solve for all k, your adaptive timestep is limited to the worst timestep over all k systems.
That's a good point. Here's the number of hierarchy!
evaluations per k mode with the current defaults:
So indeed if they were all at the max, that's a factor of 2 slower. However, if the whole thing were vectorized, each iteration might be faster, and I also wonder whether the solvers we have available couldn't do something smart, so maybe you can win back that factor of 2 and more? Its not obvious to me you can't, so I might give it a try. This will be even more favorable on GPU most likely.
Just thinking out loud but looking at some benchmarks of the hierarchy solve (using this timing macro I just added)
────────────────────────────────────────────────────────────────────────────────────────────────────────────
Time Allocations
────────────────────── ───────────────────────
Tot / % measured: 8.27s / 96.7% 868MiB / 100%
Section ncalls time %tot avg alloc %tot avg
────────────────────────────────────────────────────────────────────────────────────────────────────────────
source_grid(…) (spectra.jl:5) 1 8.00s 100% 8.00s 866MiB 100% 866MiB
boltsolve(…) (perturbations.jl:19) 100 6.75s 84.4% 67.5ms 201MiB 23.2% 2.01MiB
hierarchy!(…) (perturbations.jl:39) 2.66M 1.85s 23.2% 697ns 0.00B 0.00% 0.00B
initial_conditions(…) (perturbations.jl:110) 100 161μs 0.00% 1.61μs 35.9KiB 0.00% 368B
hierarchy!(…) (perturbations.jl:39) 200k 159ms 1.98% 794ns 0.00B 0.00% 0.00B
source_function(…) (perturbations.jl:167) 200k 113ms 1.42% 566ns 0.00B 0.00% 0.00B
────────────────────────────────────────────────────────────────────────────────────────────────────────────
A single call to hierarchy!
is 697ns, and its only 20% of the full hierarchy solve, the rest is basically the overhead of the DiffEq solver. So switching to a single ODE might be quite good actually. The other possibility is using a StaticArray for u
instead of an Array, since its so small, might make some of those ops inside the solver faster. Or playing with the choice of solver and its options, not sure how much of that you've done.
That's such a great unicode timing macro! I didn't realize the ODE solver was so heavy, though it sort of makes sense since hierarchy!
doesn't have that much going on. Very interested to see if mega-ode will be simultaneously more elegant and efficient.
I've spent some time considering true Planck units ħ = c = G = kb = 1, and I even wrote a package. I don't think it's appropriate for numerics. Since we choose to normalize the gravitational potential perturbation Phi = 1, the remaining unitful quantity is H (k also has units of H₀). You want a unit system where H₀ is close to 1, and unfortunately in Planck units H₀= 10⁻⁶¹. 1/H₀^5 is about the maximum size of a double, which is way too close to the edge.
I think this is actually a facet of the cosmological constant problem! The vacuum energy density is way too small compared to what you get from dimensional analysis, so the resulting Hubble constant in terms of those constants is also too small.
Anyway, I'll write a small Unitful-derived package where you can turn things with units into powers of Mpc, just like CLASS and CAMB.
You want a unit system where H₀ is close to 1, and unfortunately in Planck units H₀= 10⁻⁶¹. 1/H₀^5 is about the maximum size of a double, which is way too close to the edge.
Is there anywhere in the code though we would actualy need to multiply by H₀^5 then later divide by it? I can think of H₀^2 but not H₀^5. Fwiw Cosmology.jl is all Planck units and I never ran into issues.
I've just registered UnitfulCosmo.jl to deal with the units. People can just convert their unit system to either Planck or Mpc units, and then convert them back at the end. Both should work with the current equations in Bolt. The package may have general applicability for other Julia cosmologists.
Quick update on some hacking I've done (all deadends ha but I think worth mentioning):
I tried doing a single ODE solve over all k's, but you're right that you really need to include the sparsity of the resulting jacobian (block diag in k) to get anywhere near decent performance, and at that point its actually not clear its any better than just solving the different k's individually. I actually don't think this is the right path to GPU either, so overall I'd say deadend. https://github.com/xzackli/Bolt.jl/compare/main...marius311:megaode
I tried rewriting the ODEs entirely with StaticArrays. Was able to improve the performance of hierarchy!
by about a factor of 2. However, like found above, hierarchy!
is actually a small part of the runtime. Most of it boils down to lu!
on the Jacobian, and actually lu!(::StaticArray)
for our 34x34 matrix is ~50% slower than lu!(::Array)
. At the very least, the fact that this is pretty cleanly possible is good since it might be useful for a path to GPU. https://github.com/xzackli/Bolt.jl/compare/main...marius311:staticarrays
Purely as code cleanup so you don't have to worry about indexing, I redid things with ComponentArrays. Sadly there's a factor of ~2x performance hits, which kind of sucks since ComponentArrays were supposed to be overhead free. For now I'd stay away from feeding them into the solver (although they can probably still be used internally to clean up the indexing). https://github.com/xzackli/Bolt.jl/compare/main...marius311:componentarrays
Something that does work:
Beyond that, I'm also curious to look into whether exploiting the sparsity pattern of the current jacobian can help speed up the bulk of the computation (the matrix factorizations). In terms of [Θ, Θᵖ, 𝒩, Φ, δ, v, δb, vb]
, it looks like this:
So fairly sparse, but 34x34 is small enough that the overhead of SparseArrays
still makes factorization slower than on a dense array. So if there's something to do here, it doesnt appear to be as simple as that.
Will continue hacking and hopefully collect some of the improvements into something that can be merged soon.
I'm kind of delighted to see someone as obsessed with perf as I am, heh.
Re: solvers, I only really tried the various rosenbrock methods. I imagine a comprehensive solver performance comparison in the cost-error plane for these equations would be interesting and novel -- I believe Rackauckas said the BDF methods from Sundials would be most similar to CLASS's ndf15. CAMB uses DVERK, which I believe is just an adaptive-stepsize RK.
Regarding the array hacking, these seem like pretty valuable finds, and I'm wondering what the best place to preserve this wisdom is. I guess this PR will be googleable?
It seems like it shouldn't be that hard to write a fast array type that just sweetens the syntax of what we're doing to the component vector. I'll write a short issue, seems like it might be a future fun weekend activity.
I believe Rackauckas said the BDF methods from Sundials would be most similar to CLASS's ndf15. CAMB uses DVERK, which I believe is just an adaptive-stepsize RK
Correct me if I'm wrong, but we need a pure Julia solver so we can do autodiff (and GPU)? That would rule out Sundials. Of course, we have the advantage of DiffEq making things trivial to swap, so we could in theory use Sundials for CPU-non-AD cases, and seamlessly swap to something else for AD and/or GPU runs, with no change to the equations code which is pretty nice.
Also what are you thinking re: TCA? I think it'd be pretty cool if using some more sophistcated DiffEq solvers we can get acceptable performance without needing it, making the code a whole lot easier to understand.
Regarding the array hacking, these seem like pretty valuable finds, and I'm wondering what the best place to preserve this wisdom is. I guess this PR will be googleable?
I think just in those branches I posted above, that's why I pushed them up.
It seems like it shouldn't be that hard to write a fast array type that just sweetens the syntax of what we're doing to the component vector. I'll write a short issue, seems like it might be a future fun weekend activity.
I'm planning to include that in this PR, I think easiest is just use ComponentArrays but wrap/unwrap them before feeding into the solver.
Correct me if I'm wrong, but we need a pure Julia solver so we can do autodiff (and GPU)? That would rule out Sundials. Of course, we have the advantage of DiffEq making things trivial to swap, so we could in theory use Sundials for CPU-non-AD cases, and seamlessly swap to something else for AD and/or GPU runs, with no change to the equations code which is pretty nice.
For sure, we don't want any Sundials! Mostly I was ruminating on the choice of solver and how something like rodas5 compares to the (very different) solvers used in CAMB and CLASS. You may have seen this very informative StackExchange post (by... you guessed it, Chris Rackauckas. that guy is a huge nerd) that I found very interesting for comparing the stiff methods:
~I'm not aware of a comparison in the literature of these methods for this problem domain.~ One of the CLASS papers makes some comparisons between RK and NDF.
Also what are you thinking re: TCA? I think it'd be pretty cool if using some more sophistcated DiffEq solvers we can get acceptable performance without needing it, making the code a whole lot easier to understand.
I implemented the 0th order TCA in Callin06 which did not improve accuracy nor performance. It's a subject with a long history, so I think it deserves a little extra digging before we get rid of it. However, I'd love to not have any of that stuff in here.
Hey Zack, finally went through some of this in more detail like I said I would. I went up through
background.jl
and not any further, but plan to look at the rest too. Please take everything here as suggestion / topic for discussion, its no problem if you don't like some or all of this, these are just my ideas for laying a really solid ground work. I imagine we could iterate on some of this, and eventually I'll include stuff in other files in this PR, then down the line merge something which may or may not resemble these inital comments at all. (and obviously I think this is an awesome project!). Some categories of comments:Probably too ambitious:
(H′, Θ′, Θᵖ′) = ....
and evolve the whole thing once? You could include reionization parameters in here too, since you're going to recode that?Physical:
Code:
I would rename
AbstractCosmoParams
->Params
, since that is what gets written alot. Concrete types can then be calledΛCDMParams
, etc... (I did see you made the aliasACP
but this reads more clearly I'd say) Probably similar forAbstractBackground
.I would have in mind a definite way you want people to add extended models from the absolute get-go, to guide your design and to avoid the CAMB/CLASS problem of 10 forks that will never work with each other. Here's how I'd propose it'd works. Say you just want to add some extra background component with some overall amplitude. You'd do
(
@extends
is from StructExtender.jl which I actually think is a very good solution to doing this in Julia, and I plan to register/support it since I'm going to be using in CMBLensing.jl)There's other things you could think of doing, including using traits, hardcore abstract-type + getter-function usage, etc..., but I think the above is a nice combination of simple and extensible. Very open to disucssion on this.
I would follow the loose Julia convention that arguments that you dispatch on go first, so I would but
par
first in general. Also, I would name it𝕡
to help it stick out from other stuff. See here for how it looks.For code succinctness, I'd get rid of underscores in variable names unless theres a whole word after, like
ρ_crit
.Note: there's
\pprime
instead of two\prime
's (also\ppprime
and\pppprime
)