yumeng5 / Spherical-Text-Embedding

[NeurIPS 2019] Spherical Text Embedding
Apache License 2.0
175 stars 32 forks source link

Segmentation Fault #8

Closed Sachin19 closed 4 years ago

Sachin19 commented 4 years ago

Hi,

I'm trying to train new embeddings with your code on a corpus with approximately 4B tokens but the code gives me a segmentation fault right after reading the corpus and showing the number of tokens. I'm using ~200G of RAM. Do I need to use more memory? or could it be another issue. For reference, word2vec and fasttext trained just fine on this corpus.

Thanks in advance!

yumeng5 commented 4 years ago

Hi,

Thanks for letting me know the issue. I haven't tried running the code on a corpus with more than 4B tokens, so I can't comment on how much memory it will take approximately (I apologize for not being able to try it right now since I'm attending a conference). However, if it were due to the memory error, you should have received a memory allocation error instead of a segmentation fault.

My current best guess is that you have too many documents/paragraphs in the corpus. https://github.com/yumeng5/Spherical-Text-Embedding/blob/b0f88207189373d0500208ddacf46aa9c2bbd9da/src/jose.c#L18 As shown in the above line of code, the maximum number of documents allowed is hard-coded here. If your corpus has more documents than this number, the code will run into a segmentation fault. To solve this issue, simply change it to some number larger than the number of lines (which is equal to the number of documents/paragraphs) in your corpus file. Maybe you can give it a try to see if this solves your issue.

Please let me know if you still encounter any errors or have other questions!

Best, Yu

daskol commented 4 years ago

@Sachin19 See related issue #6.

yumeng5 commented 4 years ago

Hi @Sachin19,

Thanks again for posting this issue. I was wondering if you got a chance to try my suggestions and could provide any update on this issue?

Thanks, Yu

Sachin19 commented 4 years ago

Hi Yu,

Thank you so much for your suggestion. Line 18 was exactly the issue I was facing and it resolved the issue when I increased the number of documents.

I was also wondering if you could point me to resources on how to implement riemannian optimization in a package like pytorch.

Thanks, Sachin

yumeng5 commented 4 years ago

Hi Sachin,

Thanks for letting me know! I'm glad it solved the issue.

Regarding Riemannian optimization implementation, I'm not aware of existing PyTorch projects for the spherical space, but there are some for the hyperbolic space. For example, the Poincare embedding codebase has PyTorch implementation on Riemannian optimization in the Poincare space. Maybe you can take a look specifically at the Poincare manifold implementation where the Riemannian gradient is implemented, as well as the RSGD implementation. Although the optimization formula will be different for the spherical space, I feel the above code might be used as a great reference and template.

Please let me know if you have any other questions!

Best, Yu