data set is too big (which is too big to be held in one machine's mem), and I should break it to small daily set

xgfs / node2vec-c

node2vec implementation in C++

MIT License

51 stars 9 forks source link

data set is too big (which is too big to be held in one machine's mem), and I should break it to small daily set #5

Open jackyhawk opened 2 years ago

jackyhawk commented 2 years ago

Thanks for the excellent code.

and I met one question, my data set is too big (which can not be held in one machine's mem), and I should break it to small daily set. so I should first generate each day's walk result (sequence) and then train by other code(suan as Gensim) as word2vec.

All I want is the random walking result

as for the walking result, should I just return before the part listed as following? and then save dw_rw to disk for latter training? 1652349681(1)

xgfs commented 2 years ago

You will need to deal with multiprocessing slightly better than I do in the training loop. One option would be to just run the random walk generation and write to the file in the single thread. As for the place, it is correct.

jackyhawk commented 2 years ago

Thanks very much. Is there any other repo that is available to generate random walk sequence for big data set? I found when I use data set bigger than 10 million edge, the memory required would be bigger than my memory capacity(200G)