wichtounet / etl_vs_blaze

MIT License
1 stars 1 forks source link

Benchmark instructions? #1

Open byzhang opened 9 years ago

byzhang commented 9 years ago

Can you add a simple instruction on how to build & run the bench? I don't have distcc, and the makefile seems raised some errors when I run CXX=clang++ make release

wichtounet commented 9 years ago

Hello,

I haven't thought people would run this:D

First, have you made a recursive clone ? You need to add --recursive to the clone command, otherwise it won't work.

Then to build and run, you can:

LD=clang++ CXX=clang++ make run

If you have any more problem, please post the full error message and I'll come back to you.

Baptiste

byzhang commented 9 years ago

I was using blaze, but I'm interested to learn alternatives designed for deep learning. I add -lopenblas to Makefile and now it works:

Name | Blaze | ETL | | static_add:8192 | 0us | 0us | | dynamic_add:32768 | 2.681ms | 2.409ms | | dynamic_add:65536 | 4.618ms | 4.701ms | | dynamic_add:131072 | 9.011ms | 9.228ms | | dynamic_add_complex:32768 | 2.885ms | 3.1ms | | dynamic_add_complex:65536 | 5.844ms | 6.521ms | | dynamic_add_complex:131072 | 11.276ms | 12.651ms | | dynamic_mix:32768 | 2.741ms | 6.178ms | | dynamic_mix:65536 | 5.564ms | 12.562ms | | dynamic_mix:131072 | 10.721ms | 25.045ms | | dynamic_mix_matrix:256x256 | 5.718ms | 12.507ms | | dynamic_mix_matrix:512x512 | 38.374ms | 49.996ms | | dynamic_mix_matrix:578x769 | 65.184ms | 82.886ms | | dynamic_mmul:128x32x64 | 2.102ms | 5.873ms | | dynamic_mmul:128x128x128 | 4.381ms | 47.282ms | | dynamic_mmul:256x128x256 | 10.409ms | 224.717ms | | dynamic_mmul:256x256x256 | 15.617ms | 454.195ms | | dynamic_mmul:300x200x400 | 27.027ms | 630.774ms | | dynamic_mmul:512x512x512 | 173.133ms | 3.848796s |

I'm running it on i7 i7-5960X (i.e. AVX2), is the above result looking right to you? P.S. I notice the -march=native

wichtounet commented 9 years ago

Yes, that seems about right. I also have an AVX2 processor on which I do my tests.

ETL was especially designed for Deep Learning and I use it inside my (C)RBM and (C)DBN implementations, but it is not as well optimized as blaze. If you were to try it, I'd be very eager to help you and improve the library to your needs.

The very large difference in mmul is due to the fact that I do not use BLAS operation for matrix multiplication, but I intend to do it and I don't expect it to take a long time.

The difference in dynamic_mix can be due to several things. Either Blaze is reordering operations to reduce the number of performances or has very well tuned vectorized operations for this case. I'll probably have to inspect the generated assembly to see what makes the big difference.

byzhang commented 9 years ago

Appreciate it! I saw you already had commits on blas :) BTW, is it easy to extend dll to RNN? I noticed the learner seems specialized for RBM.

Thanks, -B

On Sun, Mar 8, 2015 at 6:31 AM, Baptiste Wicht notifications@github.com wrote:

Yes, that seems about right. I also have an AVX2 processor on which I do my tests.

ETL was especially designed for Deep Learning and I use it inside my (C)RBM and (C)DBN implementations, but it is not as well optimized as blaze. If you were to try it, I'd be very eager to help you and improve the library to your needs.

The very large difference in mmul is due to the fact that I do not use BLAS operation for matrix multiplication, but I intend to do it and I don't expect it to take a long time.

The difference in dynamic_mix can be due to several things. Either Blaze is reordering operations to reduce the number of performances or has very well tuned vectorized operations for this case. I'll probably have to inspect the generated assembly to see what makes the big difference.

— Reply to this email directly or view it on GitHub https://github.com/wichtounet/etl_vs_blaze/issues/1#issuecomment-77748981 .

wichtounet commented 9 years ago

Yes, I tried with BLAS and it is a bit faster, however Blaze is slower with BLAS mode :s They perhaps have highly optimized routines for small matrices.

You could probably reuse some things of the dll library to implement a RNN, but most of it is indeed specialized to RBM/DBN. You'd have to implement a specified trainer as well as the RNN data-structure itself.

wichtounet commented 9 years ago

With the last changes, without BLAS, using G++, there is much less differences between ETL and Blaze for mmul :)

byzhang commented 9 years ago

I don't have G++4.9, clang++ showes it's a little bit slower and it cored dumped for dynamic_mmul:512x512x512. LD=clang++ CXX=clang++ make run
./release/bin/bench

| Name | Blaze | ETL |

| static_add:8192 | 1us | 15us | | dynamic_add:32768 | 2.448ms | 2.408ms | | dynamic_add:65536 | 4.6ms | 4.707ms | | dynamic_add:131072 | 8.972ms | 9.222ms | | dynamic_add_complex:32768 | 2.9ms | 3.201ms | | dynamic_add_complex:65536 | 5.816ms | 6.407ms | | dynamic_add_complex:131072 | 11.31ms | 12.865ms | | dynamic_mix:32768 | 2.799ms | 6.311ms | | dynamic_mix:65536 | 5.556ms | 12.44ms | | dynamic_mix:131072 | 10.745ms | 24.971ms | | dynamic_mix_matrix:256x256 | 5.709ms | 12.446ms | | dynamic_mix_matrix:512x512 | 38.512ms | 50.217ms | | dynamic_mix_matrix:578x769 | 65.754ms | 82.998ms | | dynamic_mmul:128x32x64 | 2.102ms | 10.9ms | | dynamic_mmul:128x128x128 | 4.68ms | 83.452ms | | dynamic_mmul:256x128x256 | 10.314ms | 336.852ms | | dynamic_mmul:256x256x256 | 27.172ms | 656.676ms | | dynamic_mmul:300x200x400 | 25.5ms | 936.407ms | make: *\ [run] Segmentation fault (core dumped)

The core dump shows: (gdb) where

0 0x00002aadc8046d14 in dgemm_kernel () from /usr/local/lib/libopenblas.so.0

1 0x0000000000000000 in ?? ()

make debug doesn't help.

wichtounet commented 9 years ago

I haven't tried with openblas, but only with cblas, maybe there is a difference. I'll have to check that.

I'll publish the version with the optimized C++ dgemm tonight, it's better:)

wichtounet commented 9 years ago

Here are the last results I get:

a) clang++

| Name | Blaze | ETL | | static_add:8192 | 0us | 0us | | dynamic_add:32768 | 2.187ms | 2.204ms | | dynamic_add:65536 | 5.169ms | 4.661ms | | dynamic_add:131072 | 9.317ms | 8.706ms | | dynamic_add_complex:32768 | 2.604ms | 2.66ms | | dynamic_add_complex:65536 | 6.324ms | 6.651ms | | dynamic_add_complex:131072 | 11.779ms | 12.773ms | | dynamic_mix:32768 | 2.502ms | 6.717ms | | dynamic_mix:65536 | 6.008ms | 14.795ms | | dynamic_mix:131072 | 11.015ms | 28.137ms | | dynamic_mix_matrix:256x256 | 6.149ms | 14.599ms | | dynamic_mix_matrix:512x512 | 45.358ms | 56.401ms | | dynamic_mix_matrix:578x769 | 79.621ms | 96.037ms | | dynamic_mmul:128x32x64 | 2.102ms | 11.927ms | | dynamic_mmul:128x128x128 | 24.693ms | 91.466ms | | dynamic_mmul:256x128x256 | 162.359ms | 368.031ms | | dynamic_mmul:256x256x256 | 394.272ms | 725.383ms | | dynamic_mmul:300x200x400 | 320.368ms | 1.028263s | | dynamic_mmul:512x512x512 | 3.239539s | 5.796826s |

b) g++

| Name | Blaze | ETL | | static_add:8192 | 31us | 0us | | dynamic_add:32768 | 2.199ms | 2.208ms | | dynamic_add:65536 | 5.22ms | 5.325ms | | dynamic_add:131072 | 9.559ms | 9.593ms | | dynamic_add_complex:32768 | 2.622ms | 2.4ms | | dynamic_add_complex:65536 | 6.413ms | 6.101ms | | dynamic_add_complex:131072 | 11.744ms | 11.246ms | | dynamic_mix:32768 | 2.301ms | 6.717ms | | dynamic_mix:65536 | 5.511ms | 14.517ms | | dynamic_mix:131072 | 11.768ms | 28.764ms | | dynamic_mix_matrix:256x256 | 6.401ms | 15.007ms | | dynamic_mix_matrix:512x512 | 43.072ms | 56.027ms | | dynamic_mix_matrix:578x769 | 78.294ms | 95.94ms | | dynamic_mmul:128x32x64 | 2.172ms | 6.809ms | | dynamic_mmul:128x128x128 | 21.25ms | 48.456ms | | dynamic_mmul:256x128x256 | 142.54ms | 193.952ms | | dynamic_mmul:256x256x256 | 332.122ms | 370.771ms | | dynamic_mmul:300x200x400 | 293.536ms | 528.94ms | | dynamic_mmul:512x512x512 | 2.752206s | 3.007125s |

I'm quite satisfied with these last results :) Now, I'll have to work on dynamic mix on which I'm twice slower than Blaze for some reason, although I don't know how :D I also have an idea to make dynamic_mmul even faster, but I don't know when I'll have time.

You have insane results on Blaze on your computer though... Do you have BLAS enabled in Blaze ? What version of Clang do you have ? Do you have a custom config of Blaze ?

Thanks for your tests :)

byzhang commented 9 years ago

Yes, I enabled openblas for Blaze (with OpenMP enabled in openblas). clang++ --version Ubuntu clang version 3.4.2-3ubuntu2~xedgers (tags/RELEASE_34/dot2-final) (based on LLVM 3.4.2) Target: x86_64-pc-linux-gnu Thread model: posix

The customized blaze config: VERSION="release" CXX="g++" CXXFLAGS="-Werror -Wall -Wextra -Wshadow -Woverloaded-virtual -ansi -O3 -DNDEBUG -fopenmp -mavx2 -Wno-unused-local-typedefs" LIBRARY="static" BLAS="yes" BLAS_INCLUDE_PATH=/usr/local/include BLAS_INCLUDE_FILE="cblas.h" BLAS_IS_PARALLEL="yes" MPI="no" all other options are empty.

Thanks, -B

On Mon, Mar 9, 2015 at 3:59 PM, Baptiste Wicht notifications@github.com wrote:

Here are the last results I get:

a) clang++ | Name | Blaze | ETL |

| static_add:8192 | 0us | 0us | | dynamic_add:32768 | 2.187ms | 2.204ms | | dynamic_add:65536 | 5.169ms | 4.661ms | | dynamic_add:131072 | 9.317ms | 8.706ms | | dynamic_add_complex:32768 | 2.604ms | 2.66ms | | dynamic_add_complex:65536 | 6.324ms | 6.651ms | | dynamic_add_complex:131072 | 11.779ms | 12.773ms | | dynamic_mix:32768 | 2.502ms | 6.717ms | | dynamic_mix:65536 | 6.008ms | 14.795ms | Intel® Core™ i7-5960X Processor Extreme Edition (20M Cache, up to 3.50 GHz) | dynamic_mix:131072 | 11.015ms | 28.137ms | | dynamic_mix_matrix:256x256 | 6.149ms | 14.599ms | | dynamic_mix_matrix:512x512 | 45.358ms | 56.401ms | | dynamic_mix_matrix:578x769 | 79.621ms | 96.037ms | | dynamic_mmul:128x32x64 | 2.102ms | 11.927ms | | dynamic_mmul:128x128x128 | 24.693ms | 91.466ms | | dynamic_mmul:256x128x256 | 162.359ms | 368.031ms | | dynamic_mmul:256x256x256 | 394.272ms | 725.383ms | | dynamic_mmul:300x200x400 | 320.368ms | 1.028263s | | dynamic_mmul:512x512x512 | 3.239539s | 5.796826s |

b) g++ | Name | Blaze | ETL |

| static_add:8192 | 31us | 0us | | dynamic_add:32768 | 2.199ms | 2.208ms | | dynamic_add:65536 | 5.22ms | 5.325ms | | dynamic_add:131072 | 9.559ms | 9.593ms | | dynamic_add_complex:32768 | 2.622ms | 2.4ms | | dynamic_add_complex:65536 | 6.413ms | 6.101ms | | dynamic_add_complex:131072 | 11.744ms | 11.246ms | | dynamic_mix:32768 | 2.301ms | 6.717ms | | dynamic_mix:65536 | 5.511ms | 14.517ms | | dynamic_mix:131072 | 11.768ms | 28.764ms | | dynamic_mix_matrix:256x256 | 6.401ms | 15.007ms | | dynamic_mix_matrix:512x512 | 43.072ms | 56.027ms | | dynamic_mix_matrix:578x769 | 78.294ms | 95.94ms | | dynamic_mmul:128x32x64 | 2.172ms | 6.809ms | | dynamic_mmul:128x128x128 | 21.25ms | 48.456ms | | dynamic_mmul:256x128x256 | 142.54ms | 193.952ms | | dynamic_mmul:256x256x256 | 332.122ms | 370.771ms | | dynamic_mmul:300x200x400 | 293.536ms | 528.94ms | | dynamic_mmul:512x512x512 | 2.752206s | 3.007125s |

I'm quite satisfied with these last results :) Now, I'll have to work on dynamic mix on which I'm twice slower than Blaze for some reason, although I don't know how :D I also have an idea to make dynamic_mmul even faster, but I don't know when I'll have time.

You have insane results on Blaze on your computer though... Do you have BLAS enabled in Blaze ? What version of Clang do you have ? Do you have a custom config of Blaze ?

Thanks for your tests :)

— Reply to this email directly or view it on GitHub https://github.com/wichtounet/etl_vs_blaze/issues/1#issuecomment-77962911 .

wichtounet commented 9 years ago

Could try disabling BLAS_IS_PARALLEL="yes" ? it is not really fair to compare singlethreaded to multithreaded :D

byzhang commented 9 years ago

I restrict it to 1 thread: OMP_NUM_THREADS=1 release/bin/bench

| Name | Blaze | ETL |

| static_add:8192 | 0us | 0us | | dynamic_add:32768 | 2.342ms | 2.306ms | | dynamic_add:65536 | 4.599ms | 4.717ms | | dynamic_add:131072 | 9.038ms | 9.452ms | | dynamic_add_complex:32768 | 2.7ms | 3.007ms | | dynamic_add_complex:65536 | 5.785ms | 6.336ms | | dynamic_add_complex:131072 | 11.252ms | 12.478ms | | dynamic_mix:32768 | 2.511ms | 6.203ms | | dynamic_mix:65536 | 5.476ms | 12.625ms | | dynamic_mix:131072 | 10.698ms | 24.834ms | | dynamic_mix_matrix:256x256 | 5.713ms | 12.442ms | | dynamic_mix_matrix:512x512 | 39.824ms | 108.016ms | | dynamic_mix_matrix:578x769 | 65.275ms | 82.998ms | | dynamic_mmul:128x32x64 | 2.2ms | 11.521ms | | dynamic_mmul:128x128x128 | 11.021ms | 84.158ms | | dynamic_mmul:256x128x256 | 38.429ms | 337.558ms | | dynamic_mmul:256x256x256 | 73.205ms | 658.226ms | | dynamic_mmul:300x200x400 | 98.214ms | 937.596ms |

| dynamic_mmul:512x512x512 | 544.625ms | 5.197826s |

The mmul is about 3x slow. But openblas is still much faster than cblas.

Thanks, -B

On Mon, Mar 9, 2015 at 4:18 PM, Baptiste Wicht notifications@github.com wrote:

Could try disabling BLAS_IS_PARALLEL="yes" ? it is not really fair to compare singlethreaded to multithreaded :D

— Reply to this email directly or view it on GitHub https://github.com/wichtounet/etl_vs_blaze/issues/1#issuecomment-77965271 .

wichtounet commented 9 years ago

Now your Blaze results are closer to mine.

You have much worse for ETL than I do... Do you still have ETL_BLAS_MODE enabled in the makefile ?

I'll have to try openblas.

byzhang commented 9 years ago

I didn't know ETL_BLAS_MODE. With it, the difference is much smaller, mostly on dynamic_mix:

| Name | Blaze | ETL | | static_add:8192 | 0us | 0us | | dynamic_add:32768 | 2.511ms | 2.4ms | | dynamic_add:65536 | 4.513ms | 4.7ms | | dynamic_add:131072 | 8.9ms | 9.196ms | | dynamic_add_complex:32768 | 2.811ms | 3.204ms | | dynamic_add_complex:65536 | 5.788ms | 6.378ms | | dynamic_add_complex:131072 | 11.24ms | 12.735ms | | dynamic_mix:32768 | 2.704ms | 6.2ms | | dynamic_mix:65536 | 5.514ms | 12.563ms | | dynamic_mix:131072 | 10.686ms | 25.155ms | | dynamic_mix_matrix:256x256 | 5.693ms | 12.443ms | | dynamic_mix_matrix:512x512 | 38.534ms | 50.461ms | | dynamic_mix_matrix:578x769 | 65.964ms | 82.933ms | | dynamic_mmul:128x32x64 | 2.1ms | 2.305ms | | dynamic_mmul:128x128x128 | 4.513ms | 4.081ms | | dynamic_mmul:256x128x256 | 8.925ms | 9.363ms | | dynamic_mmul:256x256x256 | 15.508ms | 15.599ms | | dynamic_mmul:300x200x400 | 25.49ms | 25.327ms | | dynamic_mmul:512x512x512 | 167.548ms | 166.959ms |

wichtounet commented 9 years ago

The mode is quite new and not documented at all... I just added it (harldy correctly) to test it on mmul.

on dynamic_mmul you wanted to say ? I only use BLAS on mmul and only under some strict conditions, I'll have to improve that.

That is great :) openblas is really impressive!

byzhang commented 9 years ago

No. I'm mentioning the different is now mostly on

| dynamic_mix:32768 | 2.704ms | 6.2ms |
| dynamic_mix:65536 | 5.514ms | 12.563ms |
| dynamic_mix:131072 | 10.686ms | 25.155ms |
| dynamic_mix_matrix:256x256 | 5.693ms | 12.443ms |
| dynamic_mix_matrix:512x512 | 38.534ms | 50.461ms |
| dynamic_mix_matrix:578x769 | 65.964ms | 82.933ms |

As you discovered in previous comments. BTW, if you prefer, please feel free to close this issue as we are a little bit off topic, although very interesting and helpful.

wichtounet commented 9 years ago

Ok. Indeed, this a weak point here and it is even more contrasted on my machine.

No need to close the issue :) I'd rather have discussions like this one than an issue-free project.

byzhang commented 9 years ago

I'm trying to add https://github.com/tqchen/mshadow gpu tensor into the bench. It needs nvcc, and I remember nvcc can work with clang++ as the host compiler. But I forget the detail, and not sure what's the best way to extend your simple.cpp and your make system.

Thanks, -B

On Thu, Mar 12, 2015 at 12:22 AM, Baptiste Wicht notifications@github.com wrote:

Ok. Indeed, this a weak point here and it is even more contrasted on my machine.

No need to close the issue :) I'd rather have discussions like this one than an issue-free project.

— Reply to this email directly or view it on GitHub https://github.com/wichtounet/etl_vs_blaze/issues/1#issuecomment-78434390 .

wichtounet commented 9 years ago

As such, the make system is really not done to handle sub compilations of Cuda. It will perhaps be easier to create a second executable or at least a second source file compiled differently. At work, I don't even have a GPU to run tests :(