I have a similar bug with yours

preesee commented 6 years ago

Hi there, I am also a machine learning researcher and I reviewed your post on stackoverflow, https://stackoverflow.com/questions/48558915/pytorch-nn-linear-layer-output-nan-on-well-formed-input-and-weights. I met a quite similar problem when I trained my models : Here is my post https://stackoverflow.com/questions/50740415/why-embedding-layer-returns-nan-after-some-iterations-training I would like to know if you have got this issue resolved and how you fixed it.I don't have your contacts so I just raised an issue on your code . Please contact with me if you have any suggestion on my issue. THanks.

ssainz commented 6 years ago

I am also experiencing same issue. I wonder if you guys find solution?

My bug showcased: https://github.com/ssainz/pytorch_bug

My list of versions: (venv) [sergio@machine pytorch_bug]$ pip list version DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning. backports-abc (0.5) backports.functools-lru-cache (1.5) backports.shutil-get-terminal-size (1.0.0) bleach (2.1.3) certifi (2018.4.16) chardet (3.0.4) configparser (3.5.0) cycler (0.10.0) decorator (4.3.0) entrypoints (0.2.3) enum34 (1.1.6) functools32 (3.2.3.post2) future (0.16.0) futures (3.2.0) gym (0.10.5) html5lib (1.0.1) idna (2.6) ipykernel (4.8.2) ipython (5.7.0) ipython-genutils (0.2.0) ipywidgets (7.2.1) Jinja2 (2.10) jsonschema (2.6.0) jupyter (1.0.0) jupyter-client (5.2.3) jupyter-console (5.2.0) jupyter-core (4.4.0) kiwisolver (1.0.1) MarkupSafe (1.0) matplotlib (2.2.2) mistune (0.8.3) nbconvert (5.3.1) nbformat (4.4.0) networkx (1.11) notebook (5.5.0) numpy (1.14.4) pandocfilters (1.4.2) pathlib2 (2.3.2) pexpect (4.6.0) pickleshare (0.7.4) Pillow (5.1.0) pip (9.0.1) prompt-toolkit (1.0.15) ptyprocess (0.5.2) pyglet (1.3.2) Pygments (2.2.0) pyparsing (2.2.0) python-dateutil (2.7.3) pytz (2018.4) pyzmq (17.0.0) qtconsole (4.3.1) requests (2.18.4) scandir (1.7) scipy (1.1.0) Send2Trash (1.5.0) setuptools (28.8.0) simplegeneric (0.8.1) singledispatch (3.4.0.3) six (1.11.0) subprocess32 (3.5.2) terminado (0.8.1) testpath (0.3.1) torch (0.4.0) torchvision (0.2.1) tornado (5.0.2) traitlets (4.3.2) urllib3 (1.22) wcwidth (0.1.7) webencodings (0.5.1) wheel (0.29.0) widgetsnbextension (3.2.1)

zihualiu commented 6 years ago

@preesee @ssainz: I fixed the problem by rebuilding a virtual env. I think there may be some sort of conflicts between pytorch's dependencies and some of my other libraries. However I weren't able to tell which one was the problem.

preesee commented 6 years ago

@ssainz @zihualiu I can not find the root cause of this issue either, after I upgraded pytorch to ver0.4,(previously I installed ver0.2) this issue disappeared, however some code is incompatible to previous ver 0.2. At least pytorch works now.

ssainz commented 6 years ago

@preesee @zihualiu - Thanks! I also fixed my problem: network weights became NaN when backpropagating on a loss value of NaN: https://discuss.pytorch.org/t/nn-linear-layer-output-nan-on-well-formed-input/19935

zihualiu / pytorch_linear_bug

I have a similar bug with yours #1