Closed 1yura closed 6 years ago
Thank you again for your contribution. I will test as soon as I have some spare time :)
The modulo operator is quite slow:
seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);
can be substituted with this (should be ~30% faster):
const uint64_t p = 16807ULL * seed;
const uint64_t m = (p >> 31) + (p & 0x7fffffff);
seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m);
I did not noticed then you replaced srandom_r () :)
Added the new code with (7738fda).
I simplified the code a bit, like removing the htonl()
functions and the % 2147483647
parts.
On my x86 machine it's about 60% faster, but my commit aimed for a quick and simple integration with the existing code. Some further testing and tweaking could be useful.
In the cycle for (int j = 0; j < 31; j++) last (when j == 30) calculation of seed is useless.
Yes. I think the code could be re-written this way:
static inline uint32_t glibc_fast_seed(uint32_t seed)
{
uint32_t word0 = 0;
for (int j = 3; j < 31 + 3 - 1; j++) {
word0 += seed * glibc_seed_tbl[j];
/* This does: seed = (16807LL * seed) % 0x7fffffff
using the sum of digits method which works for mod N, base N+1 */
const uint64_t p = 16807ULL * seed; // Seed is always positive (31 bits)
seed = (p >> 31) + (p & 0x7fffffff);
}
return (word0 + seed * glibc_seed_tbl[33]) >> 1;
}
The variable seed is always positive thus the multiplication 16807ULL * seed
is 63 bits or less and after the first shift (p >> 31)
the top bits of p
are always 0
. So the second part:
/* The result might still not fit in 31 bits, if not, repeat
(conditional seems to make it slightly faster) */
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m;
is useless.
Maybe you can confirm me that.
BTW if you want come back to the IRC channel we can discuss these things there. I know that @rofl0r can be a bit annoying but bear with him ;)
Pushed the new changes (7acd739). On my laptop now the difference is much more noticeable compared to my old code: 4 times faster.
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m; is useless. Maybe you can confirm me that.
Not confirm. For example with seed 5951 glibc_fast_seed() gets wrong result. I don know why. With seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m; - all ok.
A constantly working messenger distracts me. But when there is something to discuss - I will join. And about - can ralink_random() produce any value or not - it can, i checked :)
I found something interesting:
seed = 0xfffffffe
function with code
seed = (uint32_t) (((uint64_t)16807 * seed) % 2147483647);
produces correct nonce. But function with
const uint64_t p = 16807ULL * seed;
const uint64_t m = (p >> 31) + (p & 0x7fffffff);
seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m);
produces wrong nonce.
just tested this and the difference is noticeable , thank you @1yura for your contribution , just like the ralink optimization , the code now is way more faster , great work @wiire-a @1yura
Upd. Try this fast function: https://0x0.st/sNXM.c The output must be compared with the lower 7 bits of the first word(32 bit) nonce.
Will do soon.
At the time I added the conditional because it seemed to be faster that way. In this function however it seems to make it noticeably slower so I guess we should replace:
const uint64_t p = 16807ULL * seed;
const uint64_t m = (p >> 31) + (p & 0x7fffffff);
seed = (m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m;
with this:
uint64_t p = 16807ULL * seed;
p = (p >> 31) + (p & 0x7fffffff);
seed = (p >> 31) + (p & 0x7fffffff);
In the current code negative values are never tested since the seed corresponds always to the current time, it's always positive. So I wouldn't add the check with:
seed = (seed == 0x7fffffff || seed == 0xfffffffe) ? 0 : seed;
which only slows down the cracking process.
If the current time is not retrieved from the router, the value of 0 (Unix Epoch) is returned. But seed = 0 returns a value of 0. So internally the PRNG checks if seed = 0, if it is it changes it to 1, see glibc_random_old.c#L93:
/* We must make sure the seed is not 0. Take arbitrarily 1 in this case. */
if (seed == 0)
seed = 1;
The value of 0x7ffffffff
would be ok, but it represents 03:14:07 of January 19th 2038 (UTC)
. In practice it's never tested unless the user uses --end/start 02/2038
. But honestly I wouldn't worry about it.
What's the purpose of doing:
REALTEK_FAST_SEED_STEP
over and over again, if the seed doesn't change?
It updates the seed :)
# define REALTEK_FAST_SEED_STEP p = 16807ULL * seed; m = (p >> 31) + (p & 0x7fffffff); seed = (uint32_t)((m & 0xffffffff80000000) ? ((m >> 31) + (m & 0x7fffffff)) : m);
Old:
2242.215146 task-clock (msec) # 3.727 CPUs utilized
217 context-switches # 0.097 K/sec
3 cpu-migrations # 0.001 K/sec
140 page-faults # 0.062 K/sec
5,116,201,317 cycles # 2.282 GHz (83.45%)
64,305,319 stalled-cycles-frontend # 1.26% frontend cycles idle (83.54%)
2,728,813,181 stalled-cycles-backend # 53.34% backend cycles idle (33.12%)
8,297,312,960 instructions # 1.62 insn per cycle
# 0.33 stalled cycles per insn (50.20%)
556,638,317 branches # 248.254 M/sec (66.95%)
4,441,275 branch-misses # 0.80% of all branches (83.47%)
0.601541662 seconds time elapsed
New:
2267.696976 task-clock (msec) # 3.741 CPUs utilized
227 context-switches # 0.100 K/sec
2 cpu-migrations # 0.001 K/sec
141 page-faults # 0.062 K/sec
5,179,056,557 cycles # 2.284 GHz (83.62%)
17,023,000 stalled-cycles-frontend # 0.33% frontend cycles idle (83.18%)
3,291,806,179 stalled-cycles-backend # 63.56% backend cycles idle (33.36%)
6,936,919,663 instructions # 1.34 insn per cycle
# 0.47 stalled cycles per insn (51.10%)
80,430,879 branches # 35.468 M/sec (67.60%)
320,102 branch-misses # 0.40% of all branches (83.81%)
0.606194954 seconds time elapsed
It doesn't really change anything for me. Even if you unroll or remove the for loop, the instructions are still executed sequentially since one depends on the other.
For example:
REALTEK_FAST_SEED_STEP
b += realtek_fast_lookup_4d[(uint8_t)seed]; /* This must wait operation above to finish (since seed is shared) */
REALTEK_FAST_SEED_STEP /* This must wait operation above to finish (since seed is shared) */
b += realtek_fast_lookup_5[(uint8_t)seed]; /* This must wait operation above to finish (since seed is shared) */
If instead we'd have something like:
for (...) {
// ...
independent_operation_A();
independent_operation_B();
// ...
}
those 2 would execute in parallel if they wouldn't share the same variables (seed
in our case).
Also the most expensive operation is the modulo (%) or its variant in REALTEK_FAST_SEED_STEP
, a precomputed table maybe should aim at avoiding or reducing that.
I pushed the changes to fix my previous commit so we have the same starting point to judge optimizations and modifications.
Maybe another consideration that could be useful is the fact that we test the seeds in sequence:
seed
seed + 1
seed + 2
...
or, backwards
seed
seed - 1
seed - 2
...
Not sure if this helps, but if it would we could setup some intermediate table(s), eg: initstate_table(seed)
.
I'm just throwing out ideas. I'm still not sure I understand what you did with your changes :)
I have no more ideas. It should be noted that the conversion of nonce -> seed is always fixed, therefore, those who are important speed can use the lookup tables. The full table is 16GB, the optimized one is 8GB. Or rainbow tables.
We've come a log way. A simple benchmark over a 3 years period (--start 2017 --end 2014
) on my laptop:
version | -j 1 | -j 4 | PRNG |
---|---|---|---|
1.3.0 | 108 s 362 ms | - | old |
pre-1.4.0 | 76 s 534 ms | 29 s 818 ms | old optimized |
1.4.1 | 27 s 105 ms | 9 s 221 ms | lazy |
git | 8 s 909 ms | 3 s 109 ms | yura |
Obviously the difference is less noticeable on a faster machine (e.g. my Desktop).
There will be a new release soon. Thank you :)
Data used:
./pixiewps -e d0:14:1b:15:65:6e:96:b8:5f:ce:ad:2e:8e:76:33:0d:2b:1a:c1:57:6b:b0:26:e7:a3:28:c0:e1:ba:f8:cf:91:66:43:71:17:4c:08:ee:12:ec:92:b0:51:9c:54:87:9f:21:25:5b:e5:a8:77:0e:1f:a1:88:04:70:ef:42:3c:90:e3:4d:78:47:a6:fc:b4:92:45:63:d1:af:1d:b0:c4:81:ea:d9:85:2c:51:9b:f1:dd:42:9c:16:39:51:cf:69:18:1b:13:2a:ea:2a:36:84:ca:f3:5b:c5:4a:ca:1b:20:c8:8b:b3:b7:33:9f:f7:d5:6e:09:13:9d:77:f0:ac:58:07:90:97:93:82:51:db:be:75:e8:67:15:cc:6b:7c:0c:a9:45:fa:8d:d8:d6:61:be:b7:3b:41:40:32:79:8d:ad:ee:32:b5:dd:61:bf:10:5f:18:d8:92:17:76:0b:75:c5:d9:66:a5:a4:90:47:2c:eb:a9:e3:b4:22:4f:3d:89:fb:2b -r 89:90:3d:aa:c7:52:5a:3e:33:c5:b8:33:a3:1d:bf:a5:3c:95:d4:d4:6f:4c:1b:f5:3f:9b:e8:77:47:d8:72:7b:a9:46:85:f6:83:4b:6a:11:d0:6d:1d:5b:5f:af:70:32:02:4e:f9:d7:5c:bb:45:d8:03:eb:b2:ea:5d:c7:b2:a8:05:9d:f3:7e:93:be:97:c4:d6:60:11:cc:fc:9a:d3:21:f3:b9:4b:5b:79:ae:5d:31:66:3d:7c:41:42:31:df:5a:62:64:70:97:ae:7a:7a:57:43:13:71:01:cc:80:df:c0:7d:45:2e:e6:5d:92:8f:22:28:6c:53:f6:4f:5e:f8:aa:bf:21:4b:c6:9e:37:74:86:37:f2:07:6b:26:b9:7a:6f:50:86:6e:2a:80:3a:5a:e4:0f:d3:9e:50:12:4e:86:38:05:38:c6:34:1a:b7:8a:8a:13:f9:19:f9:61:c0:75:b7:2c:9a:1d:21:d5:5c:14:e7:50:94:c8:53:a8:4d:0f:06 -s d1:61:6f:47:df:93:d1:94:80:ec:a1:23:3c:7b:0b:1e:7a:b7:c4:09:a8:b8:20:87:cb:26:89:d4:d5:34:e3:21 -z 11:58:69:bd:7c:9d:3a:8c:18:e9:b1:04:bc:2e:85:a2:39:fb:12:cd:b3:a0:a5:c5:95:88:08:1d:fb:d7:92:a7 -a 77:e6:24:e3:4f:d1:8e:57:48:a2:88:de:ee:88:0e:14:bc:8e:9e:c8:4b:eb:14:ec:9e:8b:cb:d5:d6:6d:79:97 -n 4b:86:6f:2e:76:49:21:ce:56:c2:93:63:3c:b5:a0:b5
I found the way to improvement perfomance of the realtek prng.
and the example - how to use these functions:
It works 5 more times faster than the chain srandom_r() + random_r(). I do not tested it on big endian CPU.
Edit: added syntax coloring (wiire-a)