zawy12 / difficulty-algorithms

See the Issues for difficulty algorithms
MIT License
107 stars 25 forks source link

PID controller difficulty algorithm #20

Open zawy12 opened 6 years ago

zawy12 commented 6 years ago

[ 2022 Update: The error factor (and all the equations that follow) needed to be error = target/N * (0.5-e^(-t/T))

This tweet thread supersedes this article. https://twitter.com/zawy3/status/1491958678852886554 And updated image here: https://twitter.com/zawy3/status/1492150874348691459

image

The "I" controller is redundant to the filter on P=1/N & works for any process without needing to tune (you simply calculate CDF of your sensor variable from old data). All CDF's are uniform (linear) distributions. I found the best D controller to use is the gamma CDF for the prior 2 solvetimes minus the 2 solvetimes before that (and negated). Also, it turned out best in that case to use P=D=2.6/N where N is the filter that ASERT uses (the 2.6 gives it the same stability (Std Dev) as ASERT under constant hashrate. This image below specifies the best difficulty algorithm (a PD controller) that I've found, which is a specific implementation of the diagram above, keeping in mind the D part of the PD is using 2 pairs of solvetimes and putting them into the gamma CDF instead of the exponential CDF as indicated in the diagram. Using only the 2 previous solvetimes and exp CDF to get the slope seemed like it would be too unstable and experimentally this is what I saw. The gamma CDF is a better metric for more than 1 solvetime than trying to averaging two exponential CDF results, and experiment confirmed this statistical fact.

image

]

This derives a PID controller for difficulty and shows it is as good as the best (simpler) algorithms LWMAalgorithm and EMA algorithm.

Comparing PID to existing Algorithms A simple moving averages seems like a PI controller in the sense there is a sum of past terms (the "I" part). LWMA, EMA, and Digishield give more weight to the more recent blocks which is like taking a slope into account which is the D part in a PID controller.

Derivation of PID Controller Working directly in terms of difficulty gives bad results compared to working in terms of target. Target is the inverse of difficulty, i.e. target = MaxTarget / difficulty (which I sometimes call 1/D). A PID controller adds (or subtracts) to the previous setting a weighted sum of 1) the previous error 2) the avg (sum/N) of previous errors 3) the slope of the previous errors

Each is weighted by a multiplier constant (I'll call them p, i, d) that is determined by theory and/or experiment. A PID controller is therefore: PID next_Target = prev_T + p*prev_Error + i*avg(prev_Errors) + d*slope(prev_Errors) Notice the sum and slope are over some range(s). A wider window (larger N) will be more stable. To make p, i, and d consistent for changing window sizes, I found p & i need to be scaled by 1/N. This seems to keep p, i, & d in a -1 to 1 range. Notice the slope is already in "per block" terms, so maybe that is not scaling it by 1/N .

I will make a guess at a definition of the "error". The simplest I can think of is: error = target*(ST/T - 1) where ST=solvetime and T=target solvetime. So if ST is < T, this error is negative which will lower next_Target, making it harder solve like we want. And vice versa. Putting the above together:

PID Controller

next_T, prev_T, T[i] = target values 
T = target solvetime
ST = solvetime.
p, i, and d are constants that give different weightings to each of the P-I-D terms. 

PID next_T = prev_T + p/N*prev_Error + i/N*avg(Errors) + d*slope(Errors)
where p, i, and d are the weighting factors for the adjustments.
Error[i] = target[i]*(ST[i]/T-1) for i=1 to N blocks in past.

Obligatory control diagrams. image image

The above reasoning is not how I came across this. My route began with Jacob Eliosoff's suggestion of a certain type of EMA for difficulty. My testing of it showed it was perfectly correct for difficulty in some deep way. It works great and keeps perfect solvetime better than any other algorithm. Then Tom Harding (and Neil Booth) came up with a different type of EMA. I showed it was actually an approximation of Jacob's and was almost exactly the same. That made me think back to a book I had come across (image below) that said an EMA (not necessarily Jacob's or Tom's) could be used as a PID controller. So I started with Jacob's to get the PID like the book said, and used the simplification (see below). I forgot an old lesson and first worked it in difficulty instead of target. I got it to work in a tortuous way, then later corrected that mistake. Months later when trying to simplify this article I realized the end result could have been discovered by the above reasoning. The above PID does not need the exact EMA, as long as it's not trying to be a super fast algorithm. The simplification uses e^x = 1+x for small x where x=ST/T/N. As long as N >10 ish, x is small.

I now see EMA turns out to be close to a P-controller: EMA next_target = N/(N-1)*prev_target + 1/N*prev_target *(ST/T-1) except for the N/(N-1) factor, which makes a big (catastrophic) difference if it is removed in the EMA. Conversely, briefly trying to use the N/(N-1) in the PID controller seemed to make things worse, and that would be going against the precise derivation below.

The rest of this article was written months ago and it shows my tortuous process.

Summary:

# This is targeted to replace Zcash's Digishield with equal stability and 5x fewer delays
# See bottom for complete algorithm.
p=1, i=-0.15, d=0.15, N=50, j=10, k=10 # but j=k=50 and p=0.5 works well.
# 1/N and i increase responsiveness to hashrate changes at a cost in stability. 
# d allows much faster response to *sudden* hashrate changes at a small cost in stability
# target = maxTaget/next_D but use target = 1/D here.
# next_target = 1/D + p*(errorP) + i*avg(errorI) + d*slope(errorD)*N
# use EMA error functions that keep solvetime accurate, e^x =~ 1+x for small x.
# t = ST = solvetime
errorP = (e^(t[-1]/T/N) - 1) * (1-T/t[-1]) * 1/prev_D =~ (t[-1]/T-1)/N/D[-1] 
errorI = (e^(tj/T/N) - 1) * (1-T/tj) * 1/Dj =~ (tj/T-1)/N/Dj 
errorD = (e^(tk/T/N) - 1) * (1-T/tk) * 1/Dk =~ (tk/T-1)/N/Dk
# where avg() and slope() are for each Dj or Dk and tj or tk over j or k windows.
# Possibly j = k =N is the best setting.
# Notice a slope is "per block" so you multiply N, so the N's cancel.
# update: I set j and k to simple EMA's N=28 and it seems to work at least as well (0.5, -0.2, 0.2) 
# D[-1]=difficulty most recent block, T=target solvetime, t[-1] = most recent solvetime.
# When using the approximation, a slight adjustment like 0.99 to the T value in
# errorP is needed to get a more perfect solvetime.

N in the PID2 title is really N/2.3 = 50. image

image

This has the most capability because you can fine tune how it responds. But it's not absolutely clear that it is generally better than the LWMA (LWMA) and Simple EMA.

Full derivation

PID controllers make an adjustment to a variable to achieve a desired goal based on an error signal that is the difference between where the system is and where we want it to be. The "P" of PID means it adjusts partly for the present error. The "I" means we adjust for the recent past average of the error (Integrative part). The "D" part means we use the slope (Derivative) to estimate the future based on the recent slope. Each part has a scaling factor that is tuned to get the desired results. They are then summed to determine the adjustment to the previous setting.

Let

p0 = previous value
p1 = current value
e = p1 - p0 = error signal
a = alpha
D = difficulty
T = target solvetime
t = actual solvetime(s) 
1/N = mean lifetime (which is 1/ln(2)*half-life) aka extinction coefficient.

To make a PID controller for difficulty algorithms (see image of book way below), start with the basic EMA equation ema = a*p1+(1-a)*p0 and rearrange to get ema = p0+a*(p1-p0).
Although p1 might be our most recent reading, p0 was a forecast of what we expected it to be. Then p1-p0 is the error between our forecast and the actual value. We're adding an "alpha" fraction of that error to our previous forecast to get the next forecast. But we could also add a factor for the recent rate of change of the error in case it's getting bigger. We can add a 3rd factor if we see a recent history of our estimate underestimating the average error. This would then be a standard PID controller:

ema = p0+a1*e+a2*avg(ei)+a3*slope(ej)
where
a1, a2, a3 = the P-I-D constants
ei and ej are over some range i and j in the past
avg(ei) = sum(ei)/i  = integral part of PID controller

BEGIN Old Method of Using Difficulty (can be ignored) Jacob Eliosoff deduced difficulty algorithms need (D=most recent D, t=ST=solvetime): a=1-e^(-t/T/N), p0=D, p1=D*T/t which gives great performance and near-perfect solvetimes under all conditions when used as:

Jacob Eliosoff's EMA difficulty algorithm
next_D = D*(1 + a*(T/t - 1))
a=1-e^(-t/T/N) 

Plugging this in to the full PID equation gives next_D = D+a1*error + a2*avg(error_j) + a3*slope(error_k)

However, after much work in carefully defining these constants and errors in order to get them to work (and it worked well), I remembered I had promised myself to NEVER work in terms of difficulty, but ALWAYS in target. This is apparently because solvetimes vary a lot and are in the denominator of difficulty versions, causing problems. I originally desires to work in difficulty because it is proportional to energy (hashes) and I have no grander idea of why a 1/energy like target should be better.

The tortuous way I got difficulty to work for the 2nd term was like this. It uses an avg of ST's to reduce the effect of their variations, which is not required in the target method.

a2 = i*(1-e^(-avg(ST)/T/N)) 
avg() = avg(D)*( T/avg(ST) - 1) 

END Old Method of Using Difficulty

So, due to the problems with difficulty method, instead of using Jacob Eliosoff's original EMA, I used my reverse-engineered version of Tom Harding's simple EMA to get something like Jacob's EMA which is apparently the most ideal EMA because it's the only one that can derive an ideal PID controller in a simple way. Jumping back to beginning of derivation: ema = p0+a*(p1-p0) (in form needed to apply what the book said) Defining what my EMA uses: (pT=previous target, T=target solvetime, t=ST=actual solvetime) a=1-e^(+t/T/N), p0=pT, p1=pT*T/t Just to show how this is used in a difficulty algorithm:

An EMA for target
next_target = pT*(1+a*(T/t - 1))
a=1-e^(+t/T/N) 

Now show full PID controller for target (review what the beginning of this article said on why N is used like this)

PID controller for targets
( BTW next_D = maxTarget / next_Target )
PID next_Target = pT + p*prev_Error + i*avg(Errors) + d*slope(Errors)*N
Errors  = (e^(t[j]/T/N) - 1) * (1-T/t[j]) * target[j] =~ (t[j]/T-1)/N*target[j]
j = 1 to k blocks into the past where k can differ from N, but k=N may be best
p=0.5, i = -0.2, d=0.2  (these are decent settings) 

The simplified error is based on e^x =~ 1+x for small x, which is applicable for k>10 (20?).

Standard EMA is with p=1, i=0, and d=0, so that's the starting point. The main benefit seems to come from decreasing i below zero and make d larger than zero.

End derivation.

Selecting p, i, and d for speed without harming stability When comparing algorithms, their N's should first be adjusted to respond equally fast to step functions that simulate what live coins experience. I normalize all algorithms to the same response speed because they can all be made faster by decreasing N. The best algorithm is the one that rises as fast as the others while having the lowest standard deviation. More accidental variation causes more "hash attacks" which also causes more delays. I've compared all the previous algorithms here. After finding the best algorithm, the best value for N is chosen in order to minimize the function ~c/N^0.66 + d*N where c and d are determined by the performance of existing coins.

The following charts show example settings. The first group demonstrates they respond equally fast. [edit: The PID response does not look as good and shows I made a mistake in the required normalization. Excuse: The metric came out as good and I was not looking at the charts. It looks like the steps were not far enough apart to let it drop back to zero, so it had a head start on the others in rising. ] The second groups shows this algorithm has the best stability during constant hashrate (Std Dev is printed on each chart). The third group is the most important. Those chart show how they responds to simulated miner profit motive, as defined in their titles. [edit: because of the error in the first step, the 2nd and 3rd chart benefits are overstated. The others could have had the same metric with a larger N. However, all of them would have had longer individual attacks and delays. The metric would come out the same because they would have had fewer attacks and delays. This shows the benefit of the PID: you can tune it to give twice as many delays that are 1/2 as bad which may be better than less frequent but longer delays. ]

The metrics are: % delays = % blocks part of an avg of 11 who's solvetimes were > 2xT and "blocks stolen" = number of blocks where hashrate was > 2x what the difficulty was set for. Low values of these correspond to a lower Std Dev. The charts are ranked best to worst. The values may differ for the given run.

It has thinner red bars in the last group of charts which shows miners will not be able to get "cheap blocks" for long before needing to switch coins. It has 3x fewer delays than Digishield (Zcash) while issuing fewer cheap blocks. But it seems the algorithms are comparable. The SMA is the only one clearly behind.

NOTE: The N for the PID and EMA are 2.3x higher than the N in the equations above so that the EMA responds at the same speed as the LWMA for a given N. Wherever N occurs in the math above, I use N/2.3.

These algorithms are described here.

ema_90_step

ema_90_stable

ema_90_attack

pid

Why PIDs are marginally useful for difficulty The reason PID controllers are not better here is because they're meant for systems that have inertia of some kind, like a mass on a spring that gets disturbed away from a desired set point. Since hashrate usually jumps 2x to 3x in a few blocks when difficulty drops 25% below target on accident, there's no mass on a spring here. We have something like a tangent function because hashrate can drop down to zero if difficulty/price ratio gets too high, and up to basically infinity for small coins if it gets too low. Something like the tangent function describes miner motivation which is ultimately the variable we want to control. A tangent function is a difficult mathematical starting point if you want to design a controller. It would not be a PID controller. I wouldn't be able to tune parameters based on the normal methods. The algorithm does not have access to the coin's current price which would be a valuable metric for helping to set difficulty. Unlike other processes, the thing being controlled can see and think about the controller's logic and change its behavior, throwing off any exploitable assumption the controller made.

PID difficulty algorithm

next_D = 1/ [ 1/D+p/N*(1/D*(t/T-1)) + i/N*avg(1/Dj*(tj/T-1) + d*slope(1/Dk*(tk/T-1) ] where p, i, and d are the -1 to 1 parameters for tuning it and j and k are the size of the windows for avg() and slope()

Now I can use this to write an accurate algorithm: ( I have not run this code. The testing above was only in spreadsheet )

# PID difficulty algorithm
# Based on next_D = 1 / [1/prevD + P*prevError + I*avgErrors + D*slopeErrors ]
# see https://github.com/zawy12/difficulty-algorithms/issues/20

# Set parameters
# Controller weighting factors (not the ideal ones because this algo is different from testing)
P=1,  I=-0.15, D= 0.15
# (P)roportional slowness
N=50 # (same as N = 2.3*50 = 115 in testing spreadsheet)
#  (I)ntegration sampling width to calculate recent avg of Error
Ni = 12
# (D)erivative sampling width for to calculate recent slope of Error
Nd = 12 

# integrate recent past alpha*error, using e^x = 1+x approximation
for (c=-Ni; c<0 ; c++) { 
    ST = min((6+1)*T,  max(-(6-1)*T,TS[c]-TS[c-1] ))
    avgAlphaError += 1/D[c]*(ST/T - 1)/Ni
}
# get average error per block for later use in calculating least squares slope of the alpha*error
for (c=-Nd; c<0 ; c++) { 
    ST = min((6+1)*T, max(-(6-1)*T, TS[c]-TS[c-1] ))
    avgSlopeNumeratorAlphaError +=1/D[c]*(ST/T-1)/Nd    
}
# calculate least squares slope of the alpha*error
avgNd = (Nd+1)/2
for (c=-Nd; c<0 ; c++) { 
    i++
    ST = min((6+1)*T, max(-(6-1)*T, TS[c]-TS[c-1] )) 
    error = 1/D[c]*(ST/T-1)
    slopeNumerator += (i-avgNd)*(error-avgSlopeNumeratorAlphaError)
    slopeDenominator += (i-avgNd)^2

}
slopeAlphaError = slopeNumerator/slopeDenominator

ST = min((6+1)*T, max(-(6-1)*T, TS[-1]-TS[-2] )) 
previousAlphaError = 1/D[-1]*(ST/T-1)

next_D = 1/(1/D[-1]+P/N*previousAlphaError+I/N*avgAlphaError+D*slopeAlphaError)