zpneal / backbone

R backbone package - Extract the backbone from weighted and unweighted networks
https://www.rbackbone.net
40 stars 8 forks source link

Memory usage excessive? #16

Closed deklanw closed 4 years ago

deklanw commented 4 years ago

I have 13.3k nodes in one mode, 38k in another, and 1.1m edges. Trying to run:

sdsm <- sdsm(bg)
sdsm_bb <- backbone.extract(sdsm, signed = FALSE, alpha = 0.1, fwer = "bonferroni") 
sdsm_bb

Running this in wsl2 with

memory=20GB 
swap=50GB

I'm running out of memory, I assume?:

Finding the distribution using SDSM with polytope model.

Error in cvxCanon$build_matrix(lin_vec, id_to_col_C): std::bad_alloc
Traceback:

1. sdsm(bg)
2. as.vector(polytope(B))
3. polytope(B)
4. suppressWarnings(CVXR::psolve(problem))
5. withCallingHandlers(expr, warning = function(w) if (inherits(w, 
 .     classes)) tryInvokeRestart("muffleWarning"))
6. CVXR::psolve(problem)
7. CVXR::psolve(problem)
8. perform(object@.solving_chain, object@.intermediate_problem)
9. perform(object@.solving_chain, object@.intermediate_problem)
10. perform(r, problem)
11. perform(r, problem)
12. stuffed_objective(object, problem, extractor)
13. stuffed_objective(object, problem, extractor)
14. affine(extractor, expr(problem@objective))
15. affine(extractor, expr(problem@objective))
16. affine(object, list(expr))
17. affine(object, list(expr))
18. get_problem_matrix(op_list, object@id_map)
19. cvxCanon$build_matrix(lin_vec, id_to_col_C)

I know this is a fairly large graph. Is it just inherently too memory intensive?

Thanks for any help

zpneal commented 4 years ago

Thanks. This occurs because the default sdsm model (polytope) relies on the CVXR convex optimization solver, which can be memory intensive. The polytope model is generally the best choice when it is computationally feasible, but when it's not, you can try one of the other sdsm model types. For example sdsm <- sdsm(bg, model = "chi2") will use a much faster and more efficient model that, in some preliminary testing, often yields results that are nearly as good as polytope.

deklanw commented 4 years ago

I see, thanks!

Just trying some stuff out with a smaller version of this graph with 305 nodes on one side (and 38k on the other), I'm getting

scobit: 3677 edges
chi2: 1634 edges

Examining the graphs I notice they're significantly different. The former has only 54% of nodes in the giant component but is far more dense and ball-like. The latter has 89% of nodes in the giant component and is better separated into distinct clusters. Polytope, even for this smaller graph, is seemingly not terminating.

Just from this it seems this choice makes a significant difference. In your paper The backbone of bipartite projections: Inferring relationships from co-authorship, co-sponsorship, co-attendance and other co-behaviors you seem to recommend the scobit model. But, here you recommend chi2 (and, indeed I prefer the chi2 results).

Do you have any general tips or intuition about which models to try (short of just trying all which terminate and comparing)?

Thanks for any help

zpneal commented 4 years ago

It's a good question, and an issue we're actively investigating. But, I can offer at least a few preliminary observations:

As far as a recommendation - trying everything and comparing is one option, but seems a bit dissatisfying and un-scientific. I'd suggest using either sdsm(model="chi2") or fdsm() since these should both give results pretty close to what you'd get from sdsm(model="polytope") if it were feasible. Although fdsm() will be slow given your network's size, it should still run.

I hope that's helpful. But keep me posted if other questions come up.

deklanw commented 4 years ago

Thanks, that helps

"The author of the library suggests chi2 works the best for memory-constrained environments" is almost scientific enough for me :P

If the regression methods are variable where does curveball fit into this? I just gave curveball a try and it has 240 edges and 35% of nodes in giant component, so very very sparse.

I was wondering whether fdsm might be slower time-wise but lighter memory-wise. You seem to be saying that, yes? Unfortunately, fdsm isn't working for me, see: https://github.com/domagal9/backbone/issues/17

zpneal commented 4 years ago

LOL...I'm sure there's a citation format for referencing github threads.

When curveball is used in sdsm -- sdsm(model="curveball", trials = 1000) -- it uses numerical simulation to achieve the same goal that polytope achieves via convex optimization, and that chi2 achieves approximately. We're working now on paring down the options available in sdsm, and these are the top candidates for retention.

fdsm() should be very memory light, since it only needs one matrix with the dimensions of your original bipartite (B), and a couple matrices with the dimensions of the bipartite projection. The slowness comes from needing to repeatedly compute B%*%t(B). I did see issue #17, but haven't had a chance to see why this is happening. It should be a minor coding error since there's nothing specific to fdsm that would cause that problem.

deklanw commented 4 years ago

Thanks, that all makes sense to me.

So I had a bug in how I was generating my bipartite graphs so my results are a bit different now. For all these I'm using

signed = FALSE, alpha = 0.005, fwer = "holm" and for the fdsm I'm using 2000 trials (overkill?).

Looking at results, of 305 total nodes, here are stats on the giant components:

chi2
293 nodes 
1475 edges

fdsm
305 nodes
5562 edges

scobit
183 nodes
3692 edges

Subjectively, the scobit (still) has too few nodes in the giant component, and is way too ball-like. The chi2 and the fdsm both seem subjectively pretty reasonable, despite the latter having almost 4 times as many edges! Why such a discrepancy?

I was assuming that fdsm is just the best but this part from your paper made me question that:

Second, the FDSM fixes the agent and artifact degrees of the random bipartite graphs at exactly the values in the observed bipartite network, but offers no conceptual rationale for such stringent conditioning. The logic underlying the conditional uniform graph test is that each random bipartite graph represents an alternate possible realization of agent-artifact linkages. The FDSM views it as possible for agents to have been linked to different artifacts, but not possible for agents to have been linked to different numbers of artifacts. However, this seems to go too far. For example, although a legislator may be observed to have sponsored 100 bills, in an equally plausible alternate world driven by the same political forces, the same legislator might have sponsored 99 bills, or 101 bills. Thus, even if uniformly randomly sampling from fixed-degree random bipartite networks were easy, doing so risks over-conditioning or imposing too many assumptions on the null model.

Would you say that we ought to prefer polytope or chi2 to fdsm?

Also, I took a look at the polytope code and I have a few questions. I'm new to both R (learning it to use this library) and convex optimization

Why are we maximizing the entropy of the matrix and its complement? Why not just the matrix alone?

  objective <- CVXR::Maximize(sum(CVXR::entr(matrix)+CVXR::entr(1-matrix)))

Why do we need to restrict values manually if we already listed this as a constraint?

  ### Restrict values between 0 and 1 ###
  gr <- which(new_matrix>1)
  new_matrix[gr] <- 1
  le <- which(new_matrix<0)
  new_matrix[le] <- 0

Also, is it possible that one of the alternative solvers for CVXR wouldn't use so much memory? E.g., the commercial solvers like Mosek or Gurubi? They have academic and trial licenses. Might give it a shot if I can figure it out.

Also, is there an option for keeping the significance of an edge to use as an edge weight? Why do we throw that information away by default?

zpneal commented 4 years ago

Lots of good questions here.

(1) Is 2000 trials overkill for FDSM? Actually 2000 is probably way too few. For a big bipartite matrix, it could take several orders of magnitude more to obtain sufficiently smooth distributions to get good results. If T is the number if trials, then the maximum precision you can achieve on a p-value is 1/T, so in this case you could observe p-values as small as p = 0.0005. But, because you've specified alpha = 0.005, fwer = "holm", this probably isn't small enough.

(2) Which backbone (chi2, fdsm, scobit) is right? Not sure, and it partly depends on what "right" means. SDSM was designed to provide a fast approximation of FDSM. sdsm(model="polytope") should come the closest, but that remains to be verified. The relative performance of the other SDSM model types is currently unknown, although chi2 seems promising. A couple different things may be driving what you're seeing here. First, scobit estimation is often unstable; it's included here to allow replication of Neal (2014), but may not always be a good option. Second, your fdsm solution may be unstable due to an insufficient number of trials. Third, all the backbones will be influenced by the choice of a fairly low alpha in combination with a holm-bonferroni correction.

(3) Would you say that we ought to prefer polytope or chi2 to fdsm? Not necessarily. FDSM is a more restrictive null model, which should make it more statistically powerful, but also makes it slower to run. It's reasonable to think of SDSM as a relaxed form of SDSM. When it is feasible, polytope is preferred to chi2, but it is unknown how much the two differ in practice.

(4) Why are we maximizing the entropy of the matrix and its complement? Why not just the matrix alone? Why do we need to restrict values manually if we already listed this as a constraint? I'll leave the question about entropy maximization to my co-developer, Rachel Domagalski. The manual restriction just corrects for values that are outside of [0,1] due to precision issues. Requiring a higher precision solution from CVXR would also fix this, but takes too long.

(5) Is it possible that one of the alternative solvers for CVXR wouldn't use so much memory? It is possible. We haven't implemented any commercial solvers because there isn't a straightforward way for R users to install them from the command line.

(6) Is there an option for keeping the significance of an edge to use as an edge weight? Two reasons. The simple reason is that backbone extraction methods are designed to transform a weighted (in this case, bipartite projection) into an unweighted graph, so the goal is to discard information, but smartly. The more philosophical reason is that these backbone extraction methods rests on a frequentist interpretation of null hypothesis significance testing (NHST), in which one pre-specifies a significance level (here, alpha = 0.005), and uses p-values to make binary significant/not-significant decisions. From this perspective, it doesn't make sense to think of an effect as being more or less significant.

deklanw commented 4 years ago

Thank you for your response

I've found something very very relevant here. Not sure if you already know all of this but it's new to me. There is already a well-developed way of computing a max entropy null model subject to constraints: the exponential random graph model. The problem for unipartite graphs of constraining the degree sequence is called the configuration model. The analogous problem for bipartite graph is called the Bipartite Configuration Model (BiCM).

See "Inferring monopartite projections of bipartite networks: an entropy-based approach" https://iopscience.iop.org/article/10.1088/1367-2630/aa6b38

To compute the null model for a specific graph you just do a max likelihood optimization, solving for Lagrange multipliers. If you have an m x n bipartite matrix that turns out to be finding the root of a R^(m+n) -> R^(m+n) nonlinear function. Once you have those parameters computing the probabilities for the whole matrix is easy. I'm pretty sure this is much easier than the convex optimization formulation you have. Here is a summary of the method: https://i.imgur.com/qNxLkcg.png

There are already (obscure, alpha) Python packages for doing most of this: https://github.com/tsakim/bicm and https://github.com/mat701/BiCM

I tried the latter with the S114 graph and the results look very good. I tried to make a fair comparison with your polytope method. I believe this package does a single-tailed test so I used alpha=0.025 and for your library I used alpha=0.05 with signed=FALSE. The package also is using some kind of multiple testing correction hard-coded which I think is Bonferroni? https://github.com/mat701/BiCM/blob/master/bicm/functions.py#L177-L186 so I used 'bonferroni' for yours. Polytope took about a minute to finish and BiCM took a few seconds.

The results are pretty similar: 1302 edges for polytope, 1336 for BiCM. But, the BiCM is one giant component and the polytope isn't.

Here is a visual comparison: polytope: https://i.imgur.com/AcNJwsb.png BiCM: https://i.imgur.com/TOthpxD.png

So, if I'm understanding correctly, it seems like your sdsm corresponds to a canonical ensemble and your fdsm corresponds to a microcanonical ensemble. Above I asked about which is theoretically better, well,

Remarkably, for models of networks with an extensive number of constraints, the microcanonical and canonical ensembles turn out to be inequivalent in the thermodynamic limit N→∞[33–35]. This is in contrast to the case of traditional statistical physics (except possibly at phase transitions), in which there is typically a finite number of constraints, such as total energy and total number of particles. In this sense, complex networks not only provide an important domain of application for statistical physics but can also expand our fundamental understanding of statistical physics itself and lead to new theoretical insights. Additionally, from a practical point of view, the breaking of ensemble equivalence in networks implies that the choice between microcanonical and canonical ensembles cannot be based solely on mathematical convenience, as is usually done, but rather should follow from a principled criterion. In particular, because the canonical ensemble describes systems subject to statistical fluctuations, it is the more appropriate ensemble to use when the observed values of the constraints can be affected by measurement errors, missing and spurious data or simply stochastic noise. Fortunately, as we shall see, the canonical ensemble is more analytically tractable than the microcanonical ensemble.

from "The statistical physics of real-world networks".

Here are the citations for that

33. Anand, K. & Bianconi, G. Entropy measures for
networks: toward an information theory of
complex topologies. Phys. Rev. E 80, 045102
(2009).
34. Squartini, T., de Mol, J., den Hollander, F. &
Garlaschelli, D. Breaking of ensemble equivalence in
networks. Phys. Rev. Lett. 115, 268701 (2015).
35. Squartini, T. & Garlaschelli, D. Reconnecting statistical
physics and combinatorics beyond ensemble equivalence.
Preprint at https://arxiv.org/abs/1710.11422 (2018).

I'm going to be testing larger graphs out to continue comparing, etc.

The more philosophical reason is that these backbone extraction methods rests on a frequentist interpretation of null hypothesis significance testing (NHST), in which one pre-specifies a significance level (here, alpha = 0.005), and uses p-values to make binary significant/not-significant decisions. From this perspective, it doesn't make sense to think of an effect as being more or less significant.

But, more or less significant does make sense, yes? "significance strength" makes sense to me as an edge weight.

zpneal commented 4 years ago

Thanks for the detailed suggestion. We've had the BiCM from Saracco et al on our agenda for a while, but it seems like now if I good time to get started on implementing it.

Re: significance strength - the p-values from these models indicate the probability of observing an edge weight as strong or stronger if edge weights were generated by the null model. It certainly contains some important information, but it's not necessarily clear that the information communicates something about the edge's "strength." Either way, it is preserved in the output from the backbone package's functions. hyperg(), sdsm(), and fdsm() all yield an object X of class(X) = "backbone", which is a list containing a matrix of upper-tail p-values, lower-tail p-values, and summary details. So, rather than sending this object to backbone.extract, you could directly examine, for example, X$positive, treating it like a weighted network.

deklanw commented 4 years ago

Looking forward to the BiCM!

The p-value stuff makes sense. It seems that keeping the p-values on the edges doesn't make sense in general but may be useful in some cases.

Since the answer to the original issue has been well-answered by now: "polytope is inherently memory-expensive because it's using a general convex optimization solver" I'll close it now.

Thanks for the help!