xgi-org / xgi

CompleX Group Interactions (XGI) is a Python package for higher-order networks.
https://xgi.readthedocs.io
Other
178 stars 27 forks source link

flag_complex is slow because add_simplices_from is slow #221

Closed maximelucas closed 1 year ago

maximelucas commented 1 year ago

I have a pairwise dataset with 634 nodes and 18624 edges. I want to fill its empty triangles with probability p. We have a function flag_complex that does exactly this, but it's slow: after running an hour, I gave up.

It's slow because, to enforce simpliciality of a SimplicialComplex, the method add_simplices_from add its input hyperedges, but also all their subfaces down to pairwise edges. (Even more so, many of those subfaces might be shared between hyperedges, but we're not using that fact to our advantage.)

So, even pre-computing the triangles (with p=0.2), it's slow:

S = xgi.SimplicialComplex()
S.add_nodes_from(nodes) # instantaneous
S.add_simplices_from(edges) # takes a few minutes
S.add_simplices_from(triangles) # gave up after a few minutes

On the other hand, the Hypergraph class does not add all subfaces because it does not need to. So, this runs virtually instantaneously:

H = xgi.Hypergraph()
H.add_nodes_from(nodes)
H.add_edges_from(edges)
H.add_edges_from(triangles)

Of course this "solution" is less safe from the user point of view because it allows multiedges (not allowed in SimplicialComplex) and does not enforce simpliciality (SimplicialComplex does). So it is much more error prone.

My current temporary solution is to do the above and pre-compute the triangles to emulate the flag_complex up to order 2 as follows, given a networkx Graph G:

import random
def find_triangles(G):
    triangles = set(frozenset((n, nbr, nbr2)) for n in G for nbr, nbr2 in combinations(G[n],2) if nbr in G[nbr2])
    return [set(tri) for tri in triangles]

triangles_empty = find_triangles(G)
p = 0.2
triangles = [el for el in triangles_empty if random.random() <= p]

This runs in a few seconds on my dataset.

leotrs commented 1 year ago

Another problem is that add_simplices_from is recursive. As a rule of thumb, please don't implement recursive functions or methods in XGI. At least not when the recursion is expected to go more than a few times. Python is not optimized for deep recursive calls, and even without taking recursion into account, each new call to a function is more costly in python than it is in other languages.

In this case, instead of recursion I would keep a set of simplices that have yet to be added in memory and work through adding that set. Adding all subfaces within the same function call (i.e. no recursion) and keeping things in a set instead of a list (i.e. no repetitions) is surely going to improve things.

maximelucas commented 1 year ago

I agree with the recursivity, although we don't expect deep recursions here, triangles to edges is one level. It might be a problem when considering larger groups, but does not seem to be it for my case.

In this case, instead of recursion I would keep a set of simplices that have yet to be added in memory and work through adding that set. Adding all subfaces within the same function call (i.e. no recursion) and keeping things in a set instead of a list (i.e. no repetitions) is surely going to improve things.

Agreed and I'm writing a version like that now. But I'm not seeing much improvement surprisingly. Profiling, it looks like about 90% of the time is spent checking if a new simplex is not empty and if it exists already, with the line:

if not members or self.has_simplex(members):

which we do not do in hypergraphs. Our current version also spends 90% of the time in that line. I'll see what I can do.

leotrs commented 1 year ago
if not members or self.has_simplex(members):

Currently has_simplex is calling edges.members() which is going to make it v slow. Perhaps we should just use the _edges dict directly.

maximelucas commented 1 year ago

Good call. Preliminary tests with a very small SC of 5 simplices:

S = xgi.SimplicialComplex([[1, 2], [2, 3, 4]])
S._edge
# -> {0: frozenset({1, 2}),
#     1: frozenset({2, 3, 4}),
#     2: frozenset({2, 3}),
#     3: frozenset({2, 4}),
#     4: frozenset({3, 4})}

Accessing the members dict is 30x faster (and even more on bigger cases I think):

%timeit S.edges.members(dtype=dict)
# -> 968 ns ± 2.68 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
%timeit S._edge
# -> 29.6 ns ± 0.0699 ns per loop (mean ± std. dev. of 7 runs, 10,000,000 loops each)

Checking if (2,4) (one of the last simplices) exists is 5.3x faster:

%timeit set((2,4)) in (set(S.edges.members(s)) for s in S.edges)
# -> 2.18 µs ± 8.41 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

%timeit frozenset((2,4)) in (s for s in S._edge.values())
# -> 415 ns ± 1.72 ns per loop (mean ± std. dev. of 7 runs, 1,000,000 loops each)
maximelucas commented 1 year ago

Update: add_simplices_from is faster now. flag_complex is still slow though, but it's because it's adding too many times the same subfaces. Opening a new issue #264 and closing the current one.