wiheto / teneto

Temporal Network Tools
GNU General Public License v3.0
85 stars 26 forks source link

Memory error with large network - centralities #52

Closed alberto-bracci closed 4 years ago

alberto-bracci commented 4 years ago

Hi,

I am just starting with Teneto. Installed with pip on anaconda - windows. I am trying to load the temporal network from here

I put a line "i,j,t" at the beginning of the file, then loaded it with pandas as dataframe and used: teneto.TemporalNetwork(from_df=dataframe) but I receive a memory error:

emoryError Traceback (most recent call last)

in ----> 1 tnet2 = tnet.TemporalNetwork(from_edgelist=[list(d) for d in D.values]) C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py in __init__(self, N, T, nettype, from_df, from_array, from_dict, from_edgelist, timetype, diagonal, timeunit, desc, starttime, nodelabels, timelabels, hdf5, hdf5path, forcesparse) 131 self.network_from_df(from_df) 132 if from_edgelist is not None: --> 133 self.network_from_edgelist(from_edgelist) 134 elif from_array is not None: 135 self.network_from_array(from_array, forcesparse=forcesparse) C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py in network_from_edgelist(self, edgelist) 257 colnames = ['i', 'j', 't'] 258 self.network = pd.DataFrame(edgelist, columns=colnames) --> 259 self._update_network() 260 261 def network_from_dict(self, contact): C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py in _update_network(self) 220 """ 221 self._calc_netshape() --> 222 self._set_nettype() 223 if self.nettype: 224 if self.nettype[1] == 'u': C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py in _set_nettype(self) 172 self.nettype = 'xu' 173 G1 = teneto.utils.df_to_array( --> 174 self.network, self.netshape, self.nettype) 175 self.nettype = 'xd' 176 G2 = teneto.utils.df_to_array( C:\ProgramData\Anaconda3\lib\site-packages\teneto\utils\utils.py in df_to_array(df, netshape, nettype) 749 if len(df) > 0: 750 idx = np.array(list(map(list, df.values))) --> 751 G = np.zeros([netshape[0], netshape[0], netshape[1]]) 752 if idx.shape[1] == 3: 753 if nettype[-1] == 'u': MemoryError: Am I doing something wrong or maybe this representation cannot handle large networks?
wiheto commented 4 years ago

Try adding the argument forcedense=False. At the moment it is trying to create a numpy array for your network. This will make sure a HDF5 representation is created and should be fine.

if that doesn’t work, could you tell me how big your network is.

11 nov. 2019 kl. 08:21 skrev Alberto Bracci notifications@github.com:

 Hi,

I am just starting with Teneto. Installed with pip on anaconda - windows. I am trying to load the temporal network from here

I put a line "i,j,t" at the beginning of the file, then loaded it with pandas as dataframe and used: teneto.TemporalNetwork(from_df=dataframe) but I receive a memory error:

emoryError Traceback (most recent call last) in ----> 1 tnet2 = tnet.TemporalNetwork(from_edgelist=[list(d) for d in D.values])

C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py in init(self, N, T, nettype, from_df, from_array, from_dict, from_edgelist, timetype, diagonal, timeunit, desc, starttime, nodelabels, timelabels, hdf5, hdf5path, forcesparse) 131 self.network_from_df(from_df) 132 if from_edgelist is not None: --> 133 self.network_from_edgelist(from_edgelist) 134 elif from_array is not None: 135 self.network_from_array(from_array, forcesparse=forcesparse)

C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py in network_from_edgelist(self, edgelist) 257 colnames = ['i', 'j', 't'] 258 self.network = pd.DataFrame(edgelist, columns=colnames) --> 259 self._update_network() 260 261 def network_from_dict(self, contact):

C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py in _update_network(self) 220 """ 221 self._calc_netshape() --> 222 self._set_nettype() 223 if self.nettype: 224 if self.nettype[1] == 'u':

C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py in _set_nettype(self) 172 self.nettype = 'xu' 173 G1 = teneto.utils.df_to_array( --> 174 self.network, self.netshape, self.nettype) 175 self.nettype = 'xd' 176 G2 = teneto.utils.df_to_array(

C:\ProgramData\Anaconda3\lib\site-packages\teneto\utils\utils.py in df_to_array(df, netshape, nettype) 749 if len(df) > 0: 750 idx = np.array(list(map(list, df.values))) --> 751 G = np.zeros([netshape[0], netshape[0], netshape[1]]) 752 if idx.shape[1] == 3: 753 if nettype[-1] == 'u':

MemoryError:

Am I doing something wrong or maybe this representation cannot handle large networks?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

alberto-bracci commented 4 years ago

It seems that argument is not present, at least in my version. I tried 'forcesparse=True' and also 'hdf5=True' without success. What's more, the error is the same, which I wouldn't expect in the latter case as it should use a different format. The network has around 900 nodes and 33720 time-stamped links

wiheto commented 4 years ago

Sorry I meant forcesparse=True (was sitting on a train and didn't doublecheck the argument). So the HDF5 compatibility was never complete/optimized as it was also slowing down processing on smaller networks. And it is on my todo list to fix all this in December when I have time to contribute to this instead of other projects. So bare with me here. There may be one or two errors, but we can probably get them sorted quite easily when they arise.

But this problem seems to be the function trying to figure out what type of network your input and it is trying to make the dataframe a numpy array to determine this (not optimal). So if you add the argument: nettype='bu' (or 'bd', 'wu', 'wd') depending on if you network is binary/weighted undirected/directed, this function shouldn't be called.

That is slightly bigger than most of the networks I usually use (ca 500 nodes and 1000 time points). But the HDF5 representation should work.

alberto-bracci commented 4 years ago

this indeed worked! setting the type also makes 'forcespars' or HDF5 seemingly unnecessary. Quick unrelated question (to avoid opening another issue): is it possible to have references for the centrality measures implemented in the library? Like the formula or article they are referring to.

Thanks for your quick help! Alberto

wiheto commented 4 years ago

Which centrality measure in particular are you after?

I generally follow Masuda and Lambiotte's book "A guide to temporal networks" for the maths for many of the measures. Adding citations to all the docstrings is also on the todo list. Some of them already have quite detailed information in the docmentation (e.g. here, but I've not had time to write one for every measure yet).

So if there is any you want want me to find, I can find them for you and also add them to the docstrings and provide the references for you here too.

alberto-bracci commented 4 years ago

I was mainly interested in the centrality for now. So closeness, betwenness and degree are the ones missing. I am asking because I found different definitions in different papers, and as of now I am not able to get a copy of the book to look for them by myself. Really appreciate your help here!

wiheto commented 4 years ago

Alright. I have some writing time assigned later today. So I'll add them then. So within 24 hours I'll have the the documentation of all three of those. And, especially for closeness and betweenness I'll add to the documentation of shortest temporal paths as well (as that is the place I've seen the most diffferences in equations).

wiheto commented 4 years ago

You may want to update from the developer branch: https://github.com/wiheto/teneto/tree/develop as some argument names are changing in the upcoming 0.5.0, so the documentation isn't fully in line with the functions in 0.4.6

The more in depth documentation is here:

https://teneto.readthedocs.io/en/develop/networkmeasures/temporal_closeness_centrality.html#module-teneto.networkmeasures.temporal_closeness_centrality

https://teneto.readthedocs.io/en/develop/networkmeasures/temporal_degree_centrality.html#module-teneto.networkmeasures.temporal_degree_centrality

As with a lot of teneto's documentation, I write far too quickly to get doc coverage, and sometimes loose clarity. Just leave an issue whenever anything is unclear

2 changes still to make.

So the shortest temporal paths is HDF5 ready but the calculation of closeness centrality is not. It is an easy fix. but I want to test it tomorrow to make sure it works. But since you will need the shortest temporal paths for both bet centrality and closeness, you may as well precompute that first anyway and save it.

I didn't get round to betweenness centrality docs. I'll also try and do that tomorrow.

wiheto commented 4 years ago

https://teneto.readthedocs.io/en/develop/api/teneto.networkmeasures.temporal_betweenness_centrality.html#teneto.networkmeasures.temporal_betweenness_centrality

I've also updated the normalization to follow the reference before for 0.5.0. Previously it did not divide by sigma_jk. I need to write a test to make sure this is working as expected (today or tomorrow)

Otherwise, can I close this issue now? Seems like the problems are sorted.

alberto-bracci commented 4 years ago

Yes, everything should be fine. Just a question: how quick you expect the shortest path function to be? I tried it with a network of around 90 nodes and 300 links and after 6 hours it wasn't finished yet (core i7 on laptop).

Also, it is better to first compute the shortest paths and then use them as argument for closeness and betweenness right?

alberto-bracci commented 4 years ago

Also, there might be another issue with the shortest path function: Whereas with a 'bd' network the behavior is as described above, the same network but loaded as 'bu' returns the following error:

File "", line 1, in shortest_paths = tnt.networkmeasures.shortest_temporal_path(t)

File "C:\ProgramData\Anaconda3\lib\site-packages\teneto\networkmeasures\shortest_temporal_path.py", line 201, in shortest_temporal_path network = tnet.get_network_when(ij=list(ij), t=t)

File "C:\ProgramData\Anaconda3\lib\site-packages\teneto\classes\network.py", line 483, in get_network_when return teneto.utils.get_network_when(self, **kwargs)

File "C:\ProgramData\Anaconda3\lib\site-packages\teneto\utils\utils.py", line 993, in get_network_when network['j'].isin(ij))), (network['t'].isin(t)))]

TypeError: and_ expected 2 arguments, got 1

wiheto commented 4 years ago

Yes, everything should be fine. Just a question: how quick you expect the shortest path function to be? I tried it with a network of around 90 nodes and 300 links and after 6 hours it wasn't finished yet (core i7 on laptop).

So when making the HDF5 compatible objects I compromised on speed. This is the major backbones issues regarding speed that has to be solved that is planned for the end of December (the start of #36 is relevant here).

Also, it is better to first compute the shortest paths and then use them as argument for closeness and betweenness right?

Yes, cause otherwise you have to calculate the paths twice, and that is the most computationally intense part.

Regarding the error. Interesting. I'm going to open up a new issue about that as that is about undirected HDF5 network referencing.

wiheto commented 4 years ago

Also, regarding speed of shortest_temporal_paths: to minimize the possible path space, you could change the value of steps_per_t.

The default parameter of steps_per_t in shortest_temporal_path is 'all'. This means that, at each time-point, a path can travel multiple nodes.This is not a reasonable assumption in many temporal networks. If you set this parameter to an integer (e.g. to 1 meaning that only one edge can be traveling per time-point per path), it will speed up the calculation.

wiheto commented 4 years ago

And another possible way to speed it up at the moment is to set i argument and run it in parallel (so for 90 nodes you can run 90 jobs at once. But will require access to a cluster).

wiheto commented 4 years ago

Aside from the computational time, I think all the issues here have been solved. So closing this issues.