stan code suggestions - Githubissues

bob-carpenter commented 2 years ago

Someone asked me to take a look at this repo and I have some suggestions for improving Stan code efficiency.

baggr/inst/stan/functions/prior_increment.stan

I'd suggest rewriting it as follows for efficiency and clarity:

real prior_lpdf(real theta, int family, vector par) {
    if (family == 0) return uniform_lupdf(theta | par[1], par[2]);
    else if (family == 1) return normal_lupdf(theta | par[1], par[2]);
    ...
    else return student_t_lupdf(theta | par[1], par[2], par[3]);
  }

and then call the prior in code as:

theta ~ prior(family, par);

The use of lupdf instead of lpdf will drop the normalizing constants. And the direct return cuts out the very expensive evaluation of conditionals (much more costly than arithmetic on modern hardware). At the very least, your code should use else if even if you stick to the assign and single-return style for some reason. It's also more self-documenting in making it clear that it's (a) defining a probability density function, and (b) returning based on family.

baggr/inst/stan/logit.stan

Transformed data can be compressed to a one-liner:

int K_pooled = pooling_type == 2 ? 0 : K;

and the declaration in parameters can be simplified to

real mu[pooling_type != 0];
real tau[pooling_type == 1];

It turns out that conditional = 1 ? 1 : 0 reduces to conditional when conditional can only take on values 0 or 1 (as with Stan's conditionals). You can also turn transformed parameters into one-liners that make all three cases clear.

vector[K_pooled] theta_k
  = (pooling_type == 0)
  ? eta
    : (pooling_type == 1)
      ? rep_vector(mu[1] + tau[1] * eta, K_pooled)
      : []';

I also moved the arithmetic inside so that it's done once before replicating rather than K_pooled times afterward.

Then in the model class,

vector[N] fe = Nc == 0 ? rep_vector(0, N) : X * beta;

You want to replace two sequential exclusive ifs with an else if on the second one to avoid extra conditional eval and to make it clear it's a choice.

You should vectorize prior_increment_real. This is costly to do in a loop versus rolling the vectorization into the function (i.e., taking eta as a param).

I'd be strongly inclined to try to refactor so that the choice of poolng type and Nc was rolled into the prior. It's hard to follow as written with all the conditionals. Otherwise, I'd urge you to have three top[-level conditionals, even if it means some code duplication.

if (pooling_baseline == 0) {
  ...
} else if (pooling_baseline == 1) {
  ...
}

if (pooling_type == 0) {
   ...
} else if (pooling_type == 1) {
   ...
} else {
   ...
}

if (Nc > 0) { ... }

With generated quantities, I'd urge you to break out the definitions of the three variables and rearrange the loops to make it clear what order things are defined in and under what conditions they're defined. For example logpd isn't defined if pooling_type == 0. Is that the same condition as K_test > 0?

vector[N_test] fe_test
  = (Nc == 0) ?  rep_vector(0, N_test) : X_test * beta;

vector[pooling_type == 1 ? K_test : 0] theta_k_test;
if (pooling_type == 1)
  theta_k_test = to_vector(normal_rng(rep_vector(mu[1], K_test), tau[1]));

vector logpd[K_test > 0];
if (pooling_type == 1) {
  logpd[1] = bernoulli_logit_lpmf(test_y | baseline_k[test_site] + theta_k_test[test_site] .* test_treatment + fe_test);
} else if (pooling_type == 2) {
   logpd[2] = bernoulli_logit_lpmf(test_y | baseline_k[test_site] + mu[1] * test_treatment + fe_test);

This makes it clear locally, for example, that the definition of theta_k_test is sound in that it is size 0 or filled in to size K_test. Also, what happens if pooling_type == 0 and K_test > 0? It looks like logpd will be undefined. I turned everything into a vector so that we could use vector arithmetic in the definitions. I also vectorized the bernoulli_logit calls.

The other programs could use similar changes, so I'm only going to repeat novel recommendations.

baggr/inst/stan/mutau.stan

I'd reduce the definition of theta_k to a one-liner, moving in the arithmetic inside the rep_matrix

theta_k[1] = pooling_type == 0 ? eta[1] : rep_matrix(tau[1] * eta[1] * (cumsum == 1 ? mu[1] : cumulative_sum(mu[1])), K);

But I'm unclear on why if cumsum == 1, you don't use cumulative sums (lines 58--62 of the original code).

This is the line that could hose performance:

for(k in 1:K)
      eta[1][,k] ~ multi_normal(prior_hypermean_mean, prior_hypermean_scale);

This is causing a solve of prior_hypermean_scale each iteration. What you want to do is (a) use a Cholesky-based representation, and (b) accumulate all those values from 1:K into a single array to feed once into multi_normal_cholesky. That way, it's one solve of order O(P^2) vs. K solves of order O(P^3). The way to do this is to Cholesky factor in the transformed data, reshape the variable eta, and then use a single vectorized call.

transformed data {
  cholesky_cov[3] L_prior_hypermean_scale = cholesky_decompose(prior_hypermean_scale);

parameters {
  array[pooling_type != 2, K] vector[P] eta;

model {
  eta[1] ~ multi_normal_cholesky(prior_hypermean_mean, L_prior_hypermean_scale);

This'll be a huge savings for even modest size K.

In the model, you want to turn all those sampling statements into vectorized ones. So it'd look like this:

  to_vector(theta_hat_k) ~ normal(to_vector(theta_k[1]), to_vector(se_theta_k));

It seems like this'd be more costly to create the vectors, but it allows the autodiff tree to be compressed. And you really want to move that to_vector(theta_hat_k) into the transformed data block.

All the same comments elsewhere apply s in the last model. There's lots that can be tightened up here if you care about speed. And the other models are just more of the same.

wwiecek commented 2 years ago

@bob-carpenter thanks so much again for these suggestions. Just wanted to let you know that some of them have been implemented in devel (see #140). They were incredibly helpful and I cut the runtime by up to 50% by implementing.

I will use lupdf eventually but for now my understanding is that rstan (currently 2.21.5 on CRAN) does not support this. Revision of mutau file you commented on will be in the next minor release as that whole model needs rewriting for other reasons. I am keeping this issue open until then.

bob-carpenter commented 2 years ago

I would strongly suggest moving to cmdstanr if you don't need direct access to log densities or gradients. It keeps up to date with Stan. CRAN has basically blocked further Stan releases with a combination of their small max package size and lack of dependency management.

wwiecek commented 2 years ago

Thanks! I have been using it for my ad hoc work but haven't looked into how it interacts with R packages that contain compiled code. I will do so soon, so will keep this issue open until then.

wwiecek / baggr

stan code suggestions #136

baggr/inst/stan/functions/prior_increment.stan

baggr/inst/stan/logit.stan

baggr/inst/stan/mutau.stan