zhongkaifu / Seq2SeqSharp

Seq2SeqSharp is a tensor based fast & flexible deep neural network framework written by .NET (C#). It has many highlighted features, such as automatic differentiation, different network types (Transformer, LSTM, BiLSTM and so on), multi-GPUs supported, cross-platforms (Windows, Linux, x86, x64, ARM), multimodal model for text and images and so on.
Other
197 stars 37 forks source link

Endless computation with CUDA #5

Closed usptact closed 5 years ago

usptact commented 5 years ago

I installed CUDA 10.0 and built ManagedCUDA (x64, Release) dll libraries.

Before I started training, I copied several files into the same directory as "Seq2SeqConsole.exe":

I start training with this command: /Seq2SeqConsole.exe -TaskName train -WordVectorSize 50 -HiddenSize 50 -LearningRate 0.1 -ModelFilePath alarm.model -SrcVocab data_vocab.source -TgtVocab data_vocab.target -SrcLang en -TgtLang lf -TrainCorpusPath ~/Downloads -ArchType 0 -Depth 1

The training starts and prints the following:

info,2/11/2019 11:17:18 AM Command Line = '-TaskName train -WordVectorSize 50 -HiddenSize 50 -LearningRate 0.1 -ModelFilePath alarm.model -SrcVocab data_vocab.source -TgtVocab data_vocab.target -SrcLang en -TgtLang lf -TrainCorpusPath C:/Users/vlad/Downloads -ArchType 0 -Depth 1'
info,2/11/2019 11:17:18 AM Source Language = 'en'
info,2/11/2019 11:17:18 AM Target Language = 'lf'
info,2/11/2019 11:17:18 AM SSE Enable = 'True'
info,2/11/2019 11:17:18 AM SSE Size = '256'
info,2/11/2019 11:17:18 AM Processor counter = '8'
info,2/11/2019 11:17:18 AM Hidden Size = '50'
info,2/11/2019 11:17:18 AM Word Vector Size = '50'
info,2/11/2019 11:17:18 AM Learning Rate = '0.1'
info,2/11/2019 11:17:18 AM Network Layer = '1'
info,2/11/2019 11:17:18 AM Gradient Clip = '5'
info,2/11/2019 11:17:18 AM Dropout Ratio = '0.1'
info,2/11/2019 11:17:18 AM Batch Size = '1'
info,2/11/2019 11:17:18 AM Arch Type = 'GPU_CUDA'
info,2/11/2019 11:17:18 AM Device Ids = '0'
info,2/11/2019 11:17:18 AM Loading model from 'alarm.model'...
info,2/11/2019 11:17:18 AM Initialize device '0'
Precompiling GatherScatterKernels
Precompiling Im2ColKernels
Precompiling IndexSelectKernels
Precompiling ReduceDimIndexKernels
Precompiling CudaReduceAllKernels
Precompiling CudaReduceKernels
Precompiling ElementwiseKernels
Precompiling FillCopyKernels
Precompiling SoftmaxKernels
Precompiling SpatialMaxPoolKernels
Precompiling VarStdKernels
info,2/11/2019 11:24:37 AM Loading model from 'alarm.model'...
info,2/11/2019 11:24:37 AM Initializing weights...
info,2/11/2019 11:24:37 AM Initializing weights for device '0'
info,2/11/2019 11:24:37 AM Initializing encoders and decoders for device '0'...
info,2/11/2019 11:24:37 AM Start to train...
info,2/11/2019 11:24:37 AM Shuffling training corpus...
info,2/11/2019 11:24:37 AM Base learning rate is '0.1' at epoch '0'
info,2/11/2019 11:24:37 AM Cleaning cache of weights optmiazation.'
info,2/11/2019 11:24:37 AM Start to process training corpus.
info,2/11/2019 11:24:37 AM Shuffling training corpus...

Then it gets stuck (it also took a while to do Precompiling steps). One CPU is loaded 100% and 29GB (!) of RAM is used! My system has 64GB of RAM so the RAM appears not to be the issue.

Do you have an idea what is going on? May be I missed some important step?

Note: Training using CPU only works just fine.

Thank you

zhongkaifu commented 5 years ago

The first running usually takes longer time for initialization and then it creates cuda_cache to speed up for the next running. Did you already try it with ManagedCUDA 8.0 ?

In my environment, I installed driver and libs for CUDA 10.0, but still used ManagedCUDA 8.0. It works fine for me.

usptact commented 5 years ago

Yes, I tried with original ManagedCUDA 8.0 (it was provided in the repo) as well. Same behavior. Tried on two different Windows 10 machines. I had to manually copy a bunch of dlls into the runtime directory. Is this what you did as well?

It looks like the environment variables (mentioned in the previous, now closed issue) are not really respected by the CUDA/application.

If somebody checks out you repo, what steps one needs to do to run training? In my case, I clone the repo, build Seq2SeqConsole.exe in VS2017 and then try to run. I get missing dlls error. I wonder whether I am doing setup incorrectly.

zhongkaifu commented 5 years ago

Yes, you need to copy below dlls into your current directory before running it, since they are distributed in Nvidia CUDA package, so Seq2SeqSharp doesn't include them. "*" means the CUDA version.

cublas64*.dll nvrtc64.dll nvrtc-builtins64_.dll

Are you still get stuck during training even you use ManagedCUDA 8.0 ?

usptact commented 5 years ago

Thank you for confirming. I copied the dlls you mentioned into the runtime directory (where the run .exe is).

Yes, unfortunately the issue still remains. I tried with ManagedCUDA 8.0 first (few days ago). Today, I tried on a different machine using ManagedCUDA 10.0 (checked out and built respective Github repo).

The single core of CPU is loaded 100% all the time. RAM usage fluctuates (slowly) but is around 12-18GB of RAM. The computation seems to be stuck at epoch 0. GPU load is not showing (using GPU-Z utility to monitor GPU load). Not clear what CPU is doing all that time.

zhongkaifu commented 5 years ago

What's your GPU type and model ? Can you run "nvidia-smi" and put the output here ?

usptact commented 5 years ago

I am running NVIDIA Quadro M2000M card in this laptop. I tried to run ManagedCUDA 8.0 from this device. Not sure how to get "nvidia-smi" output in Windows 10.

The desktop computer (where I ran ManagedCUDA 10.0 today) has an older NVIDIA GTX 1070. It's under Windows 10 as well.

zhongkaifu commented 5 years ago

"nvidia-smi" is usually located at "C:\Program Files\NVIDIA Corporation\NVSMI"

My dev environment is "NVIDIA GTX 1070 + Windows 10" as well and it works. Could you please 1) make a clean build (using ManagedCUDA 8.0 as dependency), 2) copy above dll to the current directory, and 3) retry it ?

usptact commented 5 years ago

@zhongkaifu I appreciate a lot for your patience guiding me through troubleshooting! I will get back to you once I get back to my computer later today.

zhongkaifu commented 5 years ago

I just made some new changes, such as upgrading default ManagedCUDA dependency to 10.0 version. I've tested it on some different GPU types, such as GTX1070/P40/K40m. Could you please check-out the latest source code, build and retry it ?

usptact commented 5 years ago

Thanks for the updates to the code! I will check it out! I am currently working on a PC with Linux, so no VS2017 access for now...

usptact commented 5 years ago

Here's the output of "nvidia-smi.exe":

$ ./nvidia-smi.exe
Wed Feb 13 21:58:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 417.35       Driver Version: 417.35       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 1070   WDDM  | 00000000:01:00.0  On |                  N/A |
|  0%   41C    P8    11W / 151W |    768MiB /  8192MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1844    C+G   ...o\2017\Community\Common7\IDE\devenv.exe N/A      |
|    0      3416    C+G   Insufficient Permissions                   N/A      |
|    0      7140    C+G   ...anizer\Elements Auto Creations 2019.exe N/A      |
|    0      8924    C+G   ...ost.CLR.x86\ServiceHub.Host.CLR.x86.exe N/A      |
|    0      9740    C+G   ...8.138.0_x64__kzf8qxf38zg5c\SkypeApp.exe N/A      |
|    0     10056    C+G   ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A      |
|    0     14084    C+G   Insufficient Permissions                   N/A      |
|    0     14420    C+G   C:\Windows\explorer.exe                    N/A      |
|    0     16880    C+G   ...6)\Google\Chrome\Application\chrome.exe N/A      |
|    0     16996    C+G   ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A      |
+-----------------------------------------------------------------------------+
usptact commented 5 years ago

I am running the same command on the same data:

Seq2SeqConsole.exe -TaskName train -WordVectorSize 50 -HiddenSize 50 -LearningRate 0.1 -ModelFilePath alarm.model -SrcVocab data_vocab.source -TgtVocab data_vocab.target -SrcLang en -TgtLang lf -TrainCorpusPath ~/Downloads -ArchType 0 -Depth 1

The current computation is stuck at:

info,2/13/2019 10:00:42 PM Command Line = '-TaskName train -WordVectorSize 50 -HiddenSize 50 -LearningRate 0.1 -ModelFilePath alarm.model -SrcVocab data_vocab.source -TgtVocab data_vocab.target -SrcLang en -TgtLang lf -TrainCorpusPath C:/Users/vlad/Downloads -ArchType 0 -Depth 1'
info,2/13/2019 10:00:42 PM Source Language = 'en'
info,2/13/2019 10:00:42 PM Target Language = 'lf'
info,2/13/2019 10:00:42 PM SSE Enable = 'True'
info,2/13/2019 10:00:42 PM SSE Size = '256'
info,2/13/2019 10:00:42 PM Processor counter = '8'
info,2/13/2019 10:00:42 PM Hidden Size = '50'
info,2/13/2019 10:00:42 PM Word Vector Size = '50'
info,2/13/2019 10:00:42 PM Learning Rate = '0.1'
info,2/13/2019 10:00:42 PM Network Layer = '1'
info,2/13/2019 10:00:42 PM Gradient Clip = '5'
info,2/13/2019 10:00:42 PM Dropout Ratio = '0.1'
info,2/13/2019 10:00:42 PM Batch Size = '1'
info,2/13/2019 10:00:42 PM Arch Type = 'GPU_CUDA'
info,2/13/2019 10:00:42 PM Device Ids = '0'
info,2/13/2019 10:00:42 PM Initialize device '0'
Precompiling GatherScatterKernels
Precompiling IndexSelectKernels
Precompiling ReduceDimIndexKernels
Precompiling CudaReduceAllKernels
Precompiling CudaReduceKernels
Precompiling ElementwiseKernels
zhongkaifu commented 5 years ago

For the first running, "Precompiling ElementwiseKernels" takes longer time to precompile element-wise kernels, and then keep them to cache folder. It takes about 15 mins in my devbox. How about yours ? How much time did you wait ?

usptact commented 5 years ago

On my box it took a couple of minutes. Right now I am waiting at the very same spot as before:

info,2/13/2019 10:17:47 PM Command Line = '-TaskName train -WordVectorSize 50 -HiddenSize 50 -LearningRate 0.1 -ModelFilePath alarm.model -SrcLang src -TgtLang tgt -TrainCorpusPath C:/Users/vlad/Downloads/alarm -ArchType 0 -Depth 1'
info,2/13/2019 10:17:47 PM Source Language = 'src'
info,2/13/2019 10:17:47 PM Target Language = 'tgt'
info,2/13/2019 10:17:47 PM SSE Enable = 'True'
info,2/13/2019 10:17:47 PM SSE Size = '256'
info,2/13/2019 10:17:47 PM Processor counter = '8'
info,2/13/2019 10:17:47 PM Hidden Size = '50'
info,2/13/2019 10:17:47 PM Word Vector Size = '50'
info,2/13/2019 10:17:47 PM Learning Rate = '0.1'
info,2/13/2019 10:17:47 PM Network Layer = '1'
info,2/13/2019 10:17:47 PM Gradient Clip = '5'
info,2/13/2019 10:17:47 PM Dropout Ratio = '0.1'
info,2/13/2019 10:17:47 PM Batch Size = '1'
info,2/13/2019 10:17:47 PM Arch Type = 'GPU_CUDA'
info,2/13/2019 10:17:47 PM Device Ids = '0'
info,2/13/2019 10:17:47 PM Initialize device '0'
Precompiling GatherScatterKernels
Precompiling IndexSelectKernels
Precompiling ReduceDimIndexKernels
Precompiling CudaReduceAllKernels
Precompiling CudaReduceKernels
Precompiling ElementwiseKernels
Precompiling FillCopyKernels
Precompiling SoftmaxKernels
Precompiling SpatialMaxPoolKernels
Precompiling VarStdKernels
info,2/13/2019 10:17:48 PM Building vocabulary from training corpus...
info,2/13/2019 10:17:48 PM Shuffling training corpus...
info,2/13/2019 10:17:48 PM Shuffle training corpus...
info,2/13/2019 10:17:48 PM Shuffled '142' sentence pairs.
info,2/13/2019 10:17:48 PM Found 0 sentences are longer than '32' tokens, ignore them.
info,2/13/2019 10:17:48 PM Source language Max term id = '161'
info,2/13/2019 10:17:48 PM Target language Max term id = '61'
info,2/13/2019 10:17:48 PM Initializing weights...
info,2/13/2019 10:17:48 PM Initializing weights for device '0'
info,2/13/2019 10:17:48 PM Initializing encoders and decoders for device '0'...
info,2/13/2019 10:17:48 PM Start to train...
info,2/13/2019 10:17:48 PM Shuffling training corpus...
info,2/13/2019 10:17:48 PM Base learning rate is '0.1' at epoch '0'
info,2/13/2019 10:17:48 PM Cleaning cache of weights optmiazation.'
info,2/13/2019 10:17:48 PM Start to process training corpus.
info,2/13/2019 10:17:48 PM Shuffling training corpus...
usptact commented 5 years ago

My corpus has only 143 training examples. Can this be an issue?

usptact commented 5 years ago

In the meanwhile, one CPU core is used 100% and RAM climbs past 10GB. And still climbing.

usptact commented 5 years ago

It works! I had to wait a while longer! The iterations are coming in almost once per second!

Before I started today, I also disabled Kaspersky Antivirus. Maybe it was slowing down CUDA compilation and other stuff.

zhongkaifu commented 5 years ago

I ran Seq2SeqSharp for hundred million corpus and it works very well. :)

Precompiling is CPU only task, this is the reason why you saw CPU usage is 100%. It only happens during the first running, for the next running, Seq2SeqSharp will use cached compiled kernels and it will runs much quickly.

In addition, I don't think "Kaspersky Antivirus" is related to this.