Closed Validark closed 1 year ago
Is there any particular reason to scale the existing capacity in exact steps of $1.5c+8$? If the only purpose of the loop is to prevent reallocation with an overly small increment, one could simply do this instead:
const better_capacity = @max(new_capacity, self.capacity +| (self.capacity / 2 + 8))
You could also just pregenerate the entire series at compile time
Is there any particular reason to scale the existing capacity in exact steps of $1.5c+8$? If the only purpose of the loop is to prevent reallocation with an overly small increment, one could simply do this instead:
const better_capacity = @max(new_capacity, self.capacity +| (self.capacity / 2 + 8))
This behaves differently for resizing from e.g. 0
to x
.
Here's the current pattern from calling ensureTotalCapacity
with current capacity of 0:
requested capacity => resulting capacity
0...8 => 8
9...20 => 20
21...38 => 38
39...65 => 65
66...105 => 105
106...165 => 165
166...255 => 255
256...390 => 390
391...593 => 593
594...897 => 897
898...1353 => 1353
1354...2037 => 2037
2038...3063 => 3063
3064...4602 => 4602
4603...6911 => 6911
6912...10374 => 10374
10375...15569 => 15569
15570...23361 => 23361
23362...35049 => 35049
35050...52581 => 52581
52582...78879 => 78879
...
The code in the OP is fairly similar:
requested capacity => resulting capacity
0...7 => 8
8...19 => 20
20...35 => 40
36...59 => 60
60...91 => 100
92...139 => 140
140...203 => 220
204...299 => 300
300...427 => 460
428...619 => 620
620...875 => 940
876...1259 => 1260
1260...1771 => 1900
1772...2539 => 2540
2540...3563 => 3820
3564...5099 => 5100
5100...7147 => 7660
7148...10219 => 10220
10220...14315 => 15340
14316...20459 => 20460
20460...28651 => 30700
28652...40939 => 40940
40940...57323 => 61420
57324...81899 => 81900
...
But with @max(new_capacity, self.capacity +| (self.capacity / 2 + 8))
:
requested capacity => resulting capacity
0...8 => 8
9...10 => 10
11...12 => 12
13...14 => 14
15...16 => 16
17...18 => 18
19...20 => 20
21...22 => 22
23...24 => 24
25...26 => 26
...
78319...78320 => 78320
78321...78322 => 78322
78323...78324 => 78324
...
You could also just pregenerate the entire series at compile time
As I understand it, this would involve an impossibly large number of possibilities, as better_capacity
depends on both the self.capacity
and the new_capacity
. For example, here's how the current better_capacity
behaves:
self.capacity -> new_capacity = better_capacity
0 -> 101 = 105
50 -> 101 = 132
100 -> 101 = 158
Here is the godbolt link: https://zig.godbolt.org/z/fTeaY1Gfn
Here is a comparison of before and after:
ensureTotalCapacity:
cmp x0, x1
b.hs .LBB0_2
.LBB0_1:
lsr x8, x0, #1
add x8, x8, #8
adds x8, x0, x8
csinv x0, x8, xzr, lo
cmp x0, x1
b.lo .LBB0_1
.LBB0_2:
ret
ensureTotalCapacity:
adds x8, x1, #20
mov w9, #63
csinv x8, x8, xzr, lo
adds x10, x0, #20
clz x11, x8
csinv x10, x10, xzr, lo
sub w12, w9, w11
clz x14, x10
sub w9, w9, w14
mov w13, #1
mov w15, #59
sub w11, w15, w11
sub w14, w15, w14
and x15, x12, #0x3f
lsl x12, x13, x12
eor x8, x12, x8
lsl x12, x13, x9
and x9, x9, #0x3f
eor x12, x12, x10
lsr x8, x8, x11
lsr x11, x12, x14
orr x8, x8, x15, lsl #4
orr x9, x11, x9, lsl #4
subs x8, x8, x9
csel x8, xzr, x8, lo
lsr x8, x8, #3
add x8, x8, #1
ubfx x9, x8, #1, #6
lsr x11, x8, #1
subs w9, w9, #1
and x8, x8, #0x1
csel w9, wzr, w9, lo
lsl x11, x13, x11
lsl x8, x8, x9
lsr x9, x0, #1
orr x8, x8, x11
add x9, x9, #8
adds x9, x0, x9
umulh x11, x10, x8
csinv x9, x9, xzr, lo
mul x8, x10, x8
cmp xzr, x11
csinv x8, x8, xzr, eq
sub x8, x8, #20
cmp x9, x8
csel x0, x9, x8, hi
ret
The branching code already wins in terms of code size, as well as source code simplicity.
I suspect if you measured this, the existing machine code would win in terms of perf as well due to the branch being extremely predictable and the fact that it rarely branches backwards.
Also, the exact equation (n / 2 + 8
) is not important. What's important is that it increases super-linearly so that appending is amortized O(1), balanced against not increasing too fast otherwise it wastes memory.
Cool math tho.
I noticed this code in
ArrayList.ensureTotalCapacity
: https://github.com/ziglang/zig/blob/91b4729962ddec96d1ee60d742326da828dae94a/lib/std/array_list.zig#L363-L380Specifically:
https://github.com/ziglang/zig/blob/91b4729962ddec96d1ee60d742326da828dae94a/lib/std/array_list.zig#L374-L377
I hope this could somehow be removed in favor of a branchless option, so what follows is an investigation into this possibility, just for fun.
First, I will disregard the fact that we are dealing with 64 bit integers, disregard the fact that doing an integer division by 2 can floor the true quotient by 0.5 when the dividend is odd, and disregard the saturating arithmetic and just write this in terms of a recursive sequence.
$$\large \begin{equation} \begin{split}
U_0 &= \texttt{capacity}\ Un &= U{n-1} \times 1.5 + 8\ \end{split} \end{equation}$$
If you remember your pre-calculus class, this recursive sequence is called "shifted geometric", because it has a multiply that is being shifted by an addition. For $\large U_0 = c$, the expansion of this recursive sequence looks like:
$$\large \begin{equation} \begin{split} U_1 = \nobreakspace &&&&&c \times 1.5 + 8\ U_2 = \nobreakspace &&&&(&c \times 1.5 + 8) \times 1.5 + 8\ U_3 = \nobreakspace &&&(&(&c \times 1.5 + 8) \times 1.5 + 8) \times 1.5 + 8\ U_4 = \nobreakspace &&(&(&(&c \times 1.5 + 8) \times 1.5 + 8) \times 1.5 + 8) \times 1.5 + 8\ U_5 = \nobreakspace &(&(&(&(&c \times 1.5 + 8) \times 1.5 + 8) \times 1.5 + 8) \times 1.5 + 8) \times 1.5 + 8\ \end{split} \end{equation}$$
To get the general equation, let's replace $\large 1.5$ with $\large r$ and $\large 8$ with $\large d$
$$\large \begin{equation} \begin{split} U_0 = \nobreakspace &&&&&c \ U_1 = \nobreakspace &&&&&c \times r + d\ U_2 = \nobreakspace &&&&(&c \times r + d) \times r + d\ U_3 = \nobreakspace &&&(&(&c \times r + d) \times r + d) \times r + d\ U_4 = \nobreakspace &&(&(&(&c \times r + d) \times r + d) \times r + d) \times r + d\ U_5 = \nobreakspace &(&(&(&(&c \times r + d) \times r + d) \times r + d) \times r + d) \times r + d\ \end{split} \end{equation}$$
Let's apply the distributive property of multiplication:
$$\large \begin{equation} \begin{split}
U_1 &=cr^1 &&&&+ dr^0\ U_2 &= cr^2 &&&+ dr^1 &+ dr^0\ U_3 &= cr^3 &&+ dr^2 &+ dr^1 &+ dr^0\ U_4 &= cr^4 &+ dr^3 &+ dr^2 &+ dr^1 &+ dr^0\ U_5 &= cr^5 + dr^4 &+ dr^3 &+ dr^2 &+ dr^1 &+ dr^0\ \end{split} \end{equation}$$
The pattern here is pretty obvious. We can express it using $\large \Sigma$ notation:
$$\large Un = cr^n + \sum{i=1}^{n} dr^{i-1}$$
You may notice that the $\large \Sigma$ term is the "sum of a finite geometric sequence". Replacing that term with the well-known formula for that allows us to write an explicit function:
$$\large f(n) = cr^n + d \left(\frac{1 - r^n}{1 - r}\right)$$
Let's put $\large 1.5$ back in for $\large r$ and $\large 8$ back in for $\large d$ and assess the damage:
$$\large f(n) = c \times 1.5^n + 8 \left(\frac{1 - 1.5^n}{1 - 1.5}\right)$$
Luckily, we can simplify $\large (1 - 1.5)$ to $\large -0.5$. Dividing by $\large -0.5$ is equivalent to multiplying by $\large -2$, which we can combine with the $\large 8$ term to get $\large -16$:
$$\large f(n) = c \times 1.5^n + -16 (1 - 1.5^n)$$
We could stop here, but let's distribute the $\large -16$:
$$\large f(n) = c \times 1.5^n - 16 + 16 \times 1.5^n$$
Since we have two terms being added which each are multiplied by $\large 1.5^n$, we factor it out like so:
$$\large f(n) = (c+16) \times 1.5^n - 16$$
This looks how we probably expected it would, and it is relatively easy to deal with. Now let's try to apply this to our original problem. The first thing we want to do, is find an $\large n$ for which $\large x \ge f(n)$, where $\large x$ is the requested
new_capacity
. To find $\large n$, we have to isolate it on the right-hand side:$$\large \begin{equation} \begin{split}
x &\ge (c+16) \times 1.5^n - 16 \ & \small \texttt{(+16 to both sides)} \ x + 16 &\ge (c+16) \times 1.5^n \ & \small \texttt{(divide by (c+16) on both sides)} \ \frac{x + 16}{c+16} &\ge 1.5^n \ & \small \texttt{(take the log of both sides)} \ \log{\left(\frac{x + 16}{c+16}\right)} &\ge \log{(1.5^n)} \ & \small \texttt{(property of logarithms on the right-hand side)} \ \log{\left(\frac{x + 16}{c+16}\right)} &\ge n\log{(1.5)} \ & \small \texttt{(divide each side by log(1.5))} \ \frac{ \log{\left(\frac{x + 16}{c+16}\right)}}{\log{(1.5)}} &\ge n \ & \small \texttt{(property of logarithms on the left-hand side)} \ \log{1.5}{\left(\frac{x + 16}{c+16}\right)} &\ge n \ & \small \texttt{(property of logarithms on the left-hand side)} \ \log{1.5}{(x + 16)} - \log_{1.5}{(c + 16)} &\ge n \ \end{split} \end{equation}$$
Now this is usable for our problem. We can compute $\large n$ by doing $\large \lceil\log{1.5}{(x + 16)} - \log{1.5}{(c + 16)}\rceil$, then plug that in to $\large n$ in $\large f(n) = (c+16) \times 1.5^n - 16$. Together, that's:
$$\large (c+16) \times 1.5^{\lceil(\log{1.5}{(x + 16)} - \log{1.5}{(c + 16)})\rceil} - 16$$
For those of you who skipped ahead, $\large c$ is
self.capacity
and $\large x$ isnew_capacity
, and this formula gives you thebetter_capacity
. Note that this formula will give numbers a bit higher than the original while loop, because the original while loop loses some 0.5's when dividing an odd number by 2.Now, the remaining question is how to compute the previous expression, or rather, an approximation of it, efficiently.
Sadly, efficiently computing the base 1.5 logarithm of an integer is not ideal. If we were allowed to change the original problem such that we could use the base 2 logarithm, that would be much easier to compute, that's just
@typeInfo(@TypeOf(c)).Int.bits - 1 - @clz(c)
(obviously, this would be an integer, so we should be careful on how the flooring of the true answer affects rounding error). Let's use this information to make an approximation. Using the change of base property of logarithms, we can rewrite the equation like so:$$\large \frac{\log_2{(x + 16)}}{\log_2{1.5}} - \frac{\log_2{(c + 16)}}{\log_2{1.5}}$$
$\large \frac{1}{\log_2{1.5}} \approx 1.7095112913514547$, so we can approximate the above expression like so:
$$\large (\log_2{(x + 16)} - \log_2{(c + 16)}) \times 1.7095112913514547$$
As hinted to earlier, we can find $\large \lceil\log_2{(x + 16)}\rceil - \lceil\log_2{(c + 16)}\rceil$ by doing
@clz(c + 15) - @clz(x + 15)
. Note that the terms are now in reverse order because the answer returned by@clz(b)
is actually $\large 63 - \lfloor\log_2{b}\rfloor$. We also subtracted 1 from 16 because we probably want the ceil base 2 logarithm instead, and the algorithm for that is64 - @clz(x - 1)
.(64 - @clz((x + 16) - 1)) - (64 - @clz((c + 16) - 1))
reduces to@clz(c + 15) - @clz(x + 15)
. That's slightly different than what we want, which is to ceil only after multiplying by $\large 1.7095112913514547$, but if we're careful about which way the rounding works, we should be fine.The other thing I notice is that $\large 1.5^{n}$ is equivalent to $\large \frac{3^{n}}{2^{n}}$. Of course, dividing by $\large 2^{n}$ is just a right shift, which means we could do the following once we determine the value of $\large n$.
$$\large (((c+16) \times 3^{n}) \gg n) - 16$$
Of course, this will have additional overflow potential even when the right shift would have taken us back into the range of
usize
. Maybe we could expand to 128 bits for the multiply. Alternatively, for powers of 1.5 where the decimal point is less relevant, we'd probably be fine with a lookup table or something so our code could be(c + 16) * powers[...]) - 16
One thing we could do is work backwards, changing $\large 1.7095112913514547$ to a nicer number like $\large 1.5$ or $\large 2$. Let's pick $\large 2$. To make it so we would multiply by $\large 2$ instead, we would change our recursive sequence to:
$$\large \begin{equation} \begin{split}
U_0 &= \texttt{capacity}\ Un &= U{n-1} \times \sqrt 2 + 8\ \end{split} \end{equation}$$
This works because $\large \frac{1}{\log_2{\sqrt 2}}$ is $\large 2$. This is still pretty close to our original formula, as $\large \sqrt 2 \approx 1.41421$ and $\large 1.41421 \approx 1.5$. If we did the same steps as before, $\large \frac{8}{1 - \sqrt 2} \approx 19.313708498984756$ would be in all the places where we had $\large 16$ in our original equations. Let's round that up to $\large 20$ this time, since we rounded $\large 1.5$ down to $\large \sqrt 2$. To do that, we change the common difference of $\large 8$ to $\large -20 (1 - \sqrt 2)$, which is about $\large 8.2842712474619$. Reminder: the point here is that when we divide this value by $\large (1 - \sqrt 2)$, we get $\large -20$ rather than the $\large -16$ we had earlier.
$$\large \begin{equation} \begin{split}
U_0 &= \texttt{capacity}\ Un &= U{n-1} \times \sqrt 2 - 20 (1 - \sqrt 2)\ Un &\approx U{n-1} \times 1.41421 + 8.2842712474619\ \end{split} \end{equation}$$
By the same steps shown above, this gives us the coveted:
$$\large (c+20) \times \sqrt 2^{\lceil 2(\log_2{(x + 20)} - \log_2{(c + 20)})\rceil} - 20$$
I.e.:
$$\large (c+20) \times \sqrt 2^{\lceil \log{\sqrt 2}{(x + 20)} - \log{\sqrt 2}{(c + 20)}\rceil} - 20$$
As mentioned before, we can find $\large \lceil\log_2{(x + 20)}\rceil - \lceil\log2{(c + 20)}\rceil$ by doing
@clz(c + 19) - @clz(x + 19)
. However, this is not close enough to $\large \lceil \log{\sqrt 2}{(x + 20)} - \log{\sqrt 2}{(c + 20)}\rceil$ for our use-case because we need at least the granularity of a $\large log{\sqrt 2}$ either way (ideally, we could use even more precision in some cases). This could be accomplished via a lookup table, or via another approximation. As an approximation, we could pretend that each odd power of $\large \sqrt 2$ is half-way between powers of $\large 2$ that fall on even powers of $\large \sqrt 2$. If you think about it, this is kind of semantically in line with what we are doing when we subtract the@clz
of two numbers, now with slightly more granularity. By AND'ing the bit directly under the most significant bit with the most significant bit, then moving it to the 1's place, we can add it (or OR it) with double the bit index of the highest set bit:This is kind of what we are looking for, with a bit more accuracy than before. We can also scale this up even more if desired, by multiplying by 4, moving the bit directly under the MSB to the 2's place, and moving the bit two positions below the MSB to the one's place. However, because we are taking the
clz
of values that have 19 added to them, we can guarantee there will always be at least 4 bits under the most significant bit that we can use to interpolate between powers of $\large \sqrt 2$, so we can scale to 4 extra bits to improve accuracy:Now that we have calculated $\large n$, the last problem is approximating $\large \sqrt 2^n$. Again, this can be done with a lookup table, or we could pretend once more that odd powers of $\large \sqrt 2$ are directly in the middle of powers of $\large 2$. Let's try that.
And here are the estimates versus what we would get from
std.math.pow(f64, std.math.sqrt2, n)
:More
``` √2^31: 49152 vs 46340.950011841735 √2^33: 98304 vs 92681.9000236835 √2^35: 196608 vs 185363.80004736703 √2^37: 393216 vs 370727.60009473417 √2^39: 786432 vs 741455.2001894685 √2^41: 1572864 vs 1482910.4003789374 √2^43: 3145728 vs 2965820.800757875 √2^45: 6291456 vs 5931641.601515752 √2^47: 12582912 vs 11863283.203031506 √2^49: 25165824 vs 23726566.406063017 √2^51: 50331648 vs 47453132.81212604 √2^53: 100663296 vs 94906265.62425211 √2^55: 201326592 vs 189812531.24850425 √2^57: 402653184 vs 379625062.4970086 √2^59: 805306368 vs 759250124.9940174 √2^61: 1610612736 vs 1518500249.9880352 √2^63: 3221225472 vs 3037000499.976071 √2^65: 6442450944 vs 6074000999.952143 √2^67: 12884901888 vs 12148001999.904287 √2^69: 25769803776 vs 24296003999.808582 √2^71: 51539607552 vs 48592007999.61717 √2^73: 103079215104 vs 97184015999.23438 √2^75: 206158430208 vs 194368031998.46878 √2^77: 412316860416 vs 388736063996.9377 √2^79: 824633720832 vs 777472127993.8755 √2^81: 1649267441664 vs 1554944255987.7512 √2^83: 3298534883328 vs 3109888511975.503 √2^85: 6597069766656 vs 6219777023951.008 √2^87: 13194139533312 vs 12439554047902.018 √2^89: 26388279066624 vs 24879108095804.043 √2^91: 52776558133248 vs 49758216191608.09 √2^93: 105553116266496 vs 99516432383216.22 √2^95: 211106232532992 vs 199032864766432.47 √2^97: 422212465065984 vs 398065729532865.06 √2^99: 844424930131968 vs 796131459065730.2 √2^101: 1688849860263936 vs 1592262918131461 √2^103: 3377699720527872 vs 3184525836262922.5 √2^105: 6755399441055744 vs 6369051672525847 √2^107: 13510798882111488 vs 12738103345051696 √2^109: 27021597764222976 vs 25476206690103400 √2^111: 54043195528445952 vs 50952413380206810 √2^113: 108086391056891904 vs 101904826760413630 √2^115: 216172782113783808 vs 203809653520827300 √2^117: 432345564227567616 vs 407619307041654700 √2^119: 864691128455135232 vs 815238614083309600 √2^121: 1729382256910270464 vs 1630477228166619600 √2^123: 3458764513820540928 vs 3260954456333240000 √2^125: 6917529027641081856 vs 6521908912666482000 √2^127: 13835058055282163712 vs 13043817825332965000 ```Well, that's pretty much all I have so far. With a little polishing, this is the code I ended up with:
Here is the godbolt link: https://zig.godbolt.org/z/fTeaY1Gfn
I would be interested to hear if anyone else has any ideas on how to improve this (beyond optimizations already performed automagically by LLVM). I would also be happy if someone volunteered to benchmark this, either in a microbenchmark or in a large system that makes heavy or light use of this function. Thanks for reading!
- Validark