wmvanvliet / elegant_efficient_numpy

Elegant and efficient NumPy
BSD 2-Clause "Simplified" License
5 stars 0 forks source link

Interesting project, random notes #1

Closed Tronic closed 4 years ago

Tronic commented 4 years ago

This is something I've been looking for, for a good while now. People have hard time comprehending that Python+Numpy is in fact often faster than C, and refuse to look any closer. Also, most Numpy tutorials are badly outdated (e.g. still using np.dot instead of @).

A few notes about random numbers, though.

For the C performance comparison, it is not quite fair to use rand() which uses a notoriously bad algorithm (for a couple of legacy reasons). The C++ standard library would be the appropriate comparison as it allows for high performance PRNGs such as mt19937_64 and pcg64 (the latter of which is not in stdlib yet, though). I believe that there are plenty of other things where Numpy still is faster than naive C, even if the C code is not calling bad stdlib functions.

Also, do note that Numpy just revamped its entire random number module, and what you are using is the old API. The new API is similar to that of C++ stdlib, i.e. instead of global random state (seeded by np.random.seed) you construct generator objects that hold their own state.

import numpy as np

# Construct a PCG64 PRNG with true-random seed
gen = np.random.default_rng()

%timeit gen.normal(size=1000000)                                       
14.8 ms ± 17.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

https://docs.scipy.org/doc/numpy/reference/random/index.html#introduction

An equivalent in C++:

#include <array>
#include <random>

// A helper function for constructing random-seeded random generators
template<typename Gen=std::mt19937_64> Gen random_generator() {
    std::random_device rd;  // Slow but true random numbers
    std::array<int, Gen::state_size> seed;
    for (int& val: seed) val = rd();
    std::seed_seq seed_seq(begin(seed), end(seed));
    return Gen(seed_seq);
}

int main() {
    auto gen = random_generator();
    std::vector<double> data(1000000);
    std::normal_distribution normal;
    for (double& v: data) v = normal(gen);
}

This takes 24 ms to create a million normal-distributed random numbers (because Mersenne Twister is not quite as good as PCG).

wmvanvliet commented 4 years ago

Thanks, @Tronic!

This tutorial is still very much work in progress, so I doubly appreciate someone giving valuable input already at this stage.

For the C performance comparison, it is not quite fair

It's not supposed to be fair. In the end, NumPy is written in C so it could never "beat C". The message I'm trying to get across is that the creators of NumPy have put a lot of thought into which algorithms to use. This includes the random number generator. So it will beat some naive C code, written by someone who did not think about things. I was not aware that rand() is known for being slow, so thanks for pointing that out. However, I feel that the C++ code you suggest is no longer naive, as in, it is not the first StackOverflow hit on "how to generate a random number in C". You are being explicit in which algorithm to use, etc. Plus the code of probably too difficult to read for someone that only knows Python. Nevertheless, I'm putting in a link to this issue in the tutorial, so this becomes a sort of footnote.

Thanks for pointing out that they overhauled the random number generation in NumPy. I was blissfully unaware of that! Of course, the NumPy code in this tutorial should be modern and top-notch, so I now use the new API. Check current master for updates.

Looking forward to more of your comments!

Tronic commented 4 years ago

matmul.c

#include <stdio.h>
#include <stdlib.h>

int main() {
    unsigned N = 1000;
    // Initialise matrices a and b of N*N ones
    double* a = (double*)malloc(N * N * sizeof(double));
    double* b = (double*)malloc(N * N * sizeof(double));
    for (unsigned n = 0; n < N * N; ++n) a[n] = b[n] = 1.0;
    // Matrix product r = a @ b
    double* r = (double*)malloc(N * N * sizeof(double));
    for (unsigned i = 0; i < N; ++i) {
        for (unsigned j = 0; j < N; ++j) {
            double s = 0.0;
            for (unsigned k = 0; k < N; ++k) {
                s += a[i * N + k] * b[k * N + j];
            }
            r[i * N + j] = s;
        }
    }
    // Verify results
    for (unsigned n = 0; n < N * N; ++n) {
        if (r[n] != N) {
            printf("Incorrect result %f\n", r[n]);
            break;
        }
    }
    free(r); free(b); free(a);
}

matmul.py

import numpy as np

N = 1000
a, b = np.ones((N, N)), np.ones((N, N))
r = a @ b
assert np.allclose(r, N)
$ clang matmul.c -o matmul -O3
$ time ./matmul
        1.72 real         1.68 user         0.02 sys
$ time python3 matmul.py
        0.21 real         0.51 user         0.11 sys

Quite brutal difference, even if taking into account the fact that Numpy uses all four CPU cores while the naive C implementation can only use one.

wmvanvliet commented 4 years ago

Indeed. Intel's MKL implementation of BLAS is insanely optimized. I believe they even have different multiplication algorithms for small vs big matrices.