Closed Tronic closed 4 years ago
Thanks, @Tronic!
This tutorial is still very much work in progress, so I doubly appreciate someone giving valuable input already at this stage.
For the C performance comparison, it is not quite fair
It's not supposed to be fair. In the end, NumPy is written in C so it could never "beat C". The message I'm trying to get across is that the creators of NumPy have put a lot of thought into which algorithms to use. This includes the random number generator. So it will beat some naive C code, written by someone who did not think about things. I was not aware that rand()
is known for being slow, so thanks for pointing that out. However, I feel that the C++ code you suggest is no longer naive, as in, it is not the first StackOverflow hit on "how to generate a random number in C". You are being explicit in which algorithm to use, etc. Plus the code of probably too difficult to read for someone that only knows Python. Nevertheless, I'm putting in a link to this issue in the tutorial, so this becomes a sort of footnote.
Thanks for pointing out that they overhauled the random number generation in NumPy. I was blissfully unaware of that! Of course, the NumPy code in this tutorial should be modern and top-notch, so I now use the new API. Check current master for updates.
Looking forward to more of your comments!
matmul.c
#include <stdio.h>
#include <stdlib.h>
int main() {
unsigned N = 1000;
// Initialise matrices a and b of N*N ones
double* a = (double*)malloc(N * N * sizeof(double));
double* b = (double*)malloc(N * N * sizeof(double));
for (unsigned n = 0; n < N * N; ++n) a[n] = b[n] = 1.0;
// Matrix product r = a @ b
double* r = (double*)malloc(N * N * sizeof(double));
for (unsigned i = 0; i < N; ++i) {
for (unsigned j = 0; j < N; ++j) {
double s = 0.0;
for (unsigned k = 0; k < N; ++k) {
s += a[i * N + k] * b[k * N + j];
}
r[i * N + j] = s;
}
}
// Verify results
for (unsigned n = 0; n < N * N; ++n) {
if (r[n] != N) {
printf("Incorrect result %f\n", r[n]);
break;
}
}
free(r); free(b); free(a);
}
matmul.py
import numpy as np
N = 1000
a, b = np.ones((N, N)), np.ones((N, N))
r = a @ b
assert np.allclose(r, N)
$ clang matmul.c -o matmul -O3
$ time ./matmul
1.72 real 1.68 user 0.02 sys
$ time python3 matmul.py
0.21 real 0.51 user 0.11 sys
Quite brutal difference, even if taking into account the fact that Numpy uses all four CPU cores while the naive C implementation can only use one.
Indeed. Intel's MKL implementation of BLAS is insanely optimized. I believe they even have different multiplication algorithms for small vs big matrices.
This is something I've been looking for, for a good while now. People have hard time comprehending that Python+Numpy is in fact often faster than C, and refuse to look any closer. Also, most Numpy tutorials are badly outdated (e.g. still using np.dot instead of @).
A few notes about random numbers, though.
For the C performance comparison, it is not quite fair to use
rand()
which uses a notoriously bad algorithm (for a couple of legacy reasons). The C++ standard library would be the appropriate comparison as it allows for high performance PRNGs such asmt19937_64
andpcg64
(the latter of which is not in stdlib yet, though). I believe that there are plenty of other things where Numpy still is faster than naive C, even if the C code is not calling bad stdlib functions.Also, do note that Numpy just revamped its entire random number module, and what you are using is the old API. The new API is similar to that of C++ stdlib, i.e. instead of global random state (seeded by np.random.seed) you construct generator objects that hold their own state.
https://docs.scipy.org/doc/numpy/reference/random/index.html#introduction
An equivalent in C++:
This takes 24 ms to create a million normal-distributed random numbers (because Mersenne Twister is not quite as good as PCG).