Closed usptact closed 5 years ago
The first running usually takes longer time for initialization and then it creates cuda_cache to speed up for the next running. Did you already try it with ManagedCUDA 8.0 ?
In my environment, I installed driver and libs for CUDA 10.0, but still used ManagedCUDA 8.0. It works fine for me.
Yes, I tried with original ManagedCUDA 8.0 (it was provided in the repo) as well. Same behavior. Tried on two different Windows 10 machines. I had to manually copy a bunch of dlls into the runtime directory. Is this what you did as well?
It looks like the environment variables (mentioned in the previous, now closed issue) are not really respected by the CUDA/application.
If somebody checks out you repo, what steps one needs to do to run training? In my case, I clone the repo, build Seq2SeqConsole.exe in VS2017 and then try to run. I get missing dlls error. I wonder whether I am doing setup incorrectly.
Yes, you need to copy below dlls into your current directory before running it, since they are distributed in Nvidia CUDA package, so Seq2SeqSharp doesn't include them. "*" means the CUDA version.
cublas64*.dll nvrtc64.dll nvrtc-builtins64_.dll
Are you still get stuck during training even you use ManagedCUDA 8.0 ?
Thank you for confirming. I copied the dlls you mentioned into the runtime directory (where the run .exe is).
Yes, unfortunately the issue still remains. I tried with ManagedCUDA 8.0 first (few days ago). Today, I tried on a different machine using ManagedCUDA 10.0 (checked out and built respective Github repo).
The single core of CPU is loaded 100% all the time. RAM usage fluctuates (slowly) but is around 12-18GB of RAM. The computation seems to be stuck at epoch 0. GPU load is not showing (using GPU-Z utility to monitor GPU load). Not clear what CPU is doing all that time.
What's your GPU type and model ? Can you run "nvidia-smi" and put the output here ?
I am running NVIDIA Quadro M2000M card in this laptop. I tried to run ManagedCUDA 8.0 from this device. Not sure how to get "nvidia-smi" output in Windows 10.
The desktop computer (where I ran ManagedCUDA 10.0 today) has an older NVIDIA GTX 1070. It's under Windows 10 as well.
"nvidia-smi" is usually located at "C:\Program Files\NVIDIA Corporation\NVSMI"
My dev environment is "NVIDIA GTX 1070 + Windows 10" as well and it works. Could you please 1) make a clean build (using ManagedCUDA 8.0 as dependency), 2) copy above dll to the current directory, and 3) retry it ?
@zhongkaifu I appreciate a lot for your patience guiding me through troubleshooting! I will get back to you once I get back to my computer later today.
I just made some new changes, such as upgrading default ManagedCUDA dependency to 10.0 version. I've tested it on some different GPU types, such as GTX1070/P40/K40m. Could you please check-out the latest source code, build and retry it ?
Thanks for the updates to the code! I will check it out! I am currently working on a PC with Linux, so no VS2017 access for now...
Here's the output of "nvidia-smi.exe":
$ ./nvidia-smi.exe
Wed Feb 13 21:58:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 417.35 Driver Version: 417.35 CUDA Version: 10.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 1070 WDDM | 00000000:01:00.0 On | N/A |
| 0% 41C P8 11W / 151W | 768MiB / 8192MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1844 C+G ...o\2017\Community\Common7\IDE\devenv.exe N/A |
| 0 3416 C+G Insufficient Permissions N/A |
| 0 7140 C+G ...anizer\Elements Auto Creations 2019.exe N/A |
| 0 8924 C+G ...ost.CLR.x86\ServiceHub.Host.CLR.x86.exe N/A |
| 0 9740 C+G ...8.138.0_x64__kzf8qxf38zg5c\SkypeApp.exe N/A |
| 0 10056 C+G ...t_cw5n1h2txyewy\ShellExperienceHost.exe N/A |
| 0 14084 C+G Insufficient Permissions N/A |
| 0 14420 C+G C:\Windows\explorer.exe N/A |
| 0 16880 C+G ...6)\Google\Chrome\Application\chrome.exe N/A |
| 0 16996 C+G ...dows.Cortana_cw5n1h2txyewy\SearchUI.exe N/A |
+-----------------------------------------------------------------------------+
I am running the same command on the same data:
Seq2SeqConsole.exe -TaskName train -WordVectorSize 50 -HiddenSize 50 -LearningRate 0.1 -ModelFilePath alarm.model -SrcVocab data_vocab.source -TgtVocab data_vocab.target -SrcLang en -TgtLang lf -TrainCorpusPath ~/Downloads -ArchType 0 -Depth 1
The current computation is stuck at:
info,2/13/2019 10:00:42 PM Command Line = '-TaskName train -WordVectorSize 50 -HiddenSize 50 -LearningRate 0.1 -ModelFilePath alarm.model -SrcVocab data_vocab.source -TgtVocab data_vocab.target -SrcLang en -TgtLang lf -TrainCorpusPath C:/Users/vlad/Downloads -ArchType 0 -Depth 1'
info,2/13/2019 10:00:42 PM Source Language = 'en'
info,2/13/2019 10:00:42 PM Target Language = 'lf'
info,2/13/2019 10:00:42 PM SSE Enable = 'True'
info,2/13/2019 10:00:42 PM SSE Size = '256'
info,2/13/2019 10:00:42 PM Processor counter = '8'
info,2/13/2019 10:00:42 PM Hidden Size = '50'
info,2/13/2019 10:00:42 PM Word Vector Size = '50'
info,2/13/2019 10:00:42 PM Learning Rate = '0.1'
info,2/13/2019 10:00:42 PM Network Layer = '1'
info,2/13/2019 10:00:42 PM Gradient Clip = '5'
info,2/13/2019 10:00:42 PM Dropout Ratio = '0.1'
info,2/13/2019 10:00:42 PM Batch Size = '1'
info,2/13/2019 10:00:42 PM Arch Type = 'GPU_CUDA'
info,2/13/2019 10:00:42 PM Device Ids = '0'
info,2/13/2019 10:00:42 PM Initialize device '0'
Precompiling GatherScatterKernels
Precompiling IndexSelectKernels
Precompiling ReduceDimIndexKernels
Precompiling CudaReduceAllKernels
Precompiling CudaReduceKernels
Precompiling ElementwiseKernels
For the first running, "Precompiling ElementwiseKernels" takes longer time to precompile element-wise kernels, and then keep them to cache folder. It takes about 15 mins in my devbox. How about yours ? How much time did you wait ?
On my box it took a couple of minutes. Right now I am waiting at the very same spot as before:
info,2/13/2019 10:17:47 PM Command Line = '-TaskName train -WordVectorSize 50 -HiddenSize 50 -LearningRate 0.1 -ModelFilePath alarm.model -SrcLang src -TgtLang tgt -TrainCorpusPath C:/Users/vlad/Downloads/alarm -ArchType 0 -Depth 1'
info,2/13/2019 10:17:47 PM Source Language = 'src'
info,2/13/2019 10:17:47 PM Target Language = 'tgt'
info,2/13/2019 10:17:47 PM SSE Enable = 'True'
info,2/13/2019 10:17:47 PM SSE Size = '256'
info,2/13/2019 10:17:47 PM Processor counter = '8'
info,2/13/2019 10:17:47 PM Hidden Size = '50'
info,2/13/2019 10:17:47 PM Word Vector Size = '50'
info,2/13/2019 10:17:47 PM Learning Rate = '0.1'
info,2/13/2019 10:17:47 PM Network Layer = '1'
info,2/13/2019 10:17:47 PM Gradient Clip = '5'
info,2/13/2019 10:17:47 PM Dropout Ratio = '0.1'
info,2/13/2019 10:17:47 PM Batch Size = '1'
info,2/13/2019 10:17:47 PM Arch Type = 'GPU_CUDA'
info,2/13/2019 10:17:47 PM Device Ids = '0'
info,2/13/2019 10:17:47 PM Initialize device '0'
Precompiling GatherScatterKernels
Precompiling IndexSelectKernels
Precompiling ReduceDimIndexKernels
Precompiling CudaReduceAllKernels
Precompiling CudaReduceKernels
Precompiling ElementwiseKernels
Precompiling FillCopyKernels
Precompiling SoftmaxKernels
Precompiling SpatialMaxPoolKernels
Precompiling VarStdKernels
info,2/13/2019 10:17:48 PM Building vocabulary from training corpus...
info,2/13/2019 10:17:48 PM Shuffling training corpus...
info,2/13/2019 10:17:48 PM Shuffle training corpus...
info,2/13/2019 10:17:48 PM Shuffled '142' sentence pairs.
info,2/13/2019 10:17:48 PM Found 0 sentences are longer than '32' tokens, ignore them.
info,2/13/2019 10:17:48 PM Source language Max term id = '161'
info,2/13/2019 10:17:48 PM Target language Max term id = '61'
info,2/13/2019 10:17:48 PM Initializing weights...
info,2/13/2019 10:17:48 PM Initializing weights for device '0'
info,2/13/2019 10:17:48 PM Initializing encoders and decoders for device '0'...
info,2/13/2019 10:17:48 PM Start to train...
info,2/13/2019 10:17:48 PM Shuffling training corpus...
info,2/13/2019 10:17:48 PM Base learning rate is '0.1' at epoch '0'
info,2/13/2019 10:17:48 PM Cleaning cache of weights optmiazation.'
info,2/13/2019 10:17:48 PM Start to process training corpus.
info,2/13/2019 10:17:48 PM Shuffling training corpus...
My corpus has only 143 training examples. Can this be an issue?
In the meanwhile, one CPU core is used 100% and RAM climbs past 10GB. And still climbing.
It works! I had to wait a while longer! The iterations are coming in almost once per second!
Before I started today, I also disabled Kaspersky Antivirus. Maybe it was slowing down CUDA compilation and other stuff.
I ran Seq2SeqSharp for hundred million corpus and it works very well. :)
Precompiling is CPU only task, this is the reason why you saw CPU usage is 100%. It only happens during the first running, for the next running, Seq2SeqSharp will use cached compiled kernels and it will runs much quickly.
In addition, I don't think "Kaspersky Antivirus" is related to this.
I installed CUDA 10.0 and built ManagedCUDA (x64, Release) dll libraries.
Before I started training, I copied several files into the same directory as "Seq2SeqConsole.exe":
I start training with this command:
/Seq2SeqConsole.exe -TaskName train -WordVectorSize 50 -HiddenSize 50 -LearningRate 0.1 -ModelFilePath alarm.model -SrcVocab data_vocab.source -TgtVocab data_vocab.target -SrcLang en -TgtLang lf -TrainCorpusPath ~/Downloads -ArchType 0 -Depth 1
The training starts and prints the following:
Then it gets stuck (it also took a while to do Precompiling steps). One CPU is loaded 100% and 29GB (!) of RAM is used! My system has 64GB of RAM so the RAM appears not to be the issue.
Do you have an idea what is going on? May be I missed some important step?
Note: Training using CPU only works just fine.
Thank you