wlin12 / wang2vec

Extension of the original word2vec using different architectures

Apache License 2.0

211 stars 49 forks source link

word2vec -negative-classes Segmentation fault #5

Open jetsnguns opened 8 years ago

jetsnguns commented 8 years ago

I'm trying to train a model using part-of-speech tags as word classes. When I supply even a very small file of size ~1000 lines with word classes, word2vec causes Segmentation fault. The same setup (train file of 100 lines) but with no -negative-classes argument finishes just fine. Can anybody suggest how to debug this? txt100.txt nc100.txt Exact command: ./word2vec -train txt100.txt -output model10.txt -hs 0 -size 20 -window 3 -type 3 -threads 1 -negative-classes nc100.txt P.S. text data and pos tags are taken from the Brown corpus.

felicialiu commented 8 years ago

I'm also getting a Segmentation fault. I'm trying to train cwindow vectors with a file of size 10^9 bytes (the English wikipedia dump).

./word2vec -train /path/to/trainingfile -output /save/path/file -type 2 -size 20 -window 1 -negative 0 -nce 10 -hs 0 -sample 1e-4 -threads 1 -binary 1 -iter 15 -cap 1

Starting training using file /path/to/trainingfile Vocab size: 218317 Words in train file: 123353508 Segmentation fault

I've also tried to use -cap 0, 4 threads, or use -negative rather than nce.

changukshin commented 8 years ago

I have same situation with @felicialiu

My backtrace is

Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fffd7578700 (LWP 10260)] 0x0000000000409418 in TrainModelThread () at word2vec.c:928 928 for (c = 0; c < layer1_size; c++) syn1[c + l2 + window_offset] += g * syn0[c + l1]; (gdb) bt

0 0x0000000000409418 in TrainModelThread () at word2vec.c:928

1 0x00007ffff7943aa1 in start_thread (arg=0x7fffd7578700) at pthread_create.c:301

2 0x00007ffff7690aad in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:115

wlin12 commented 8 years ago

Hi @jetsnguns

I looked at the nc100.txt and the problem seems that there are words with a single class like:

BE be

So if we wish to predict the word be, there are no other negative samples to choose from. I would just write a remove these lines.

Wang Ling

wlin12 commented 8 years ago

Hi @papower1 ,

Thanks for sending me the backtrack, there seems to be a bug indeed in that line as I am using the wrong set of parameters in the update for hierarchical softmax. I corrected it and checked it works in text8, so can you pull and try again.

Wang Ling

wlin12 commented 8 years ago

Hi @felicialiu,

I tried running your line with text8, and it seems to work. Do you get to the point where it prints the progress info? If it does not it might be because the program is running out of memory for the parameters.

Wang Ling

changukshin commented 8 years ago

@wlin12 btw, when I disable 'hs', It's good to go. Hope it helps.