zawy12 / difficulty-algorithms

See the Issues for difficulty algorithms
MIT License
107 stars 25 forks source link

LWMA's history #24

Open zawy12 opened 6 years ago

zawy12 commented 6 years ago

subtitle: My White Whale

Pre-LWMA In 2016 I worked on trying to improve Zcash's difficulty algorithm with many different ideas. Nothing came of it except for wasting @str4d and @daria's time. During this time I posted an errant issue to Monero recommending a simple moving average with N=17. It was errant because N=17 was much too fast which also means it randomly varies a lot. It attracts constant on-off mining by dropping 25% too low on accident all day, everyday. Karbo coin adopted it to stop terrible delays resulting from the 24-hour averaging of the default cryptonote algorithm. It saved their coin, they were grateful, and other coins noticed. I found out about Karbo 6 months later when Sumo had to fork for the same reason, adopted it with minor modifications, and contacted me. Karbo had asked for help earlier, but I was busy and didn't respond. Then more CN/Monero/Forknote/Bytecoin clones noticed and copied either Sumo and Karbo. It was known as Zawy v1 but it was no more than a simple moving average. At the time we were not aware that it was not as good as the Digishield Zcash adopted.

Karbo kept me interested in difficulty algorithms in early 2017, testing every idea I had, showing interest, and providing a bouncing board. I was unable to find anything better than a simple moving average, but I learned how to identify good and bad algorithms. BCH (after their EDA disaster) used a good simple moving average seemingly based on my inability to find something better and may have called it my idea, but it's just BTC's algo converted to a rolling average and a much smaller N. They chose N=144 instead of my N=50 and it has worked good for them. They used a symmetrical limit I promoted and devised an interesting median of 3 method (@deadalnix?) for ignoring single bad timestamps at a cost of a 1-block delay.

LWMA discovery In July 2017, @dgenr8 showed me his WT-144 being tested for BCH that used an LWMA in difficulty. My Excel spreadsheet "difficulty algorithm platform" immediately declared it a winner. I had tried it in 2016 and 2017 on difficulty/solvetime values to get next_difficulty and it didn't work. He applied the LWMA function to solvetime*target values to get next_Target. I see his sound logic in doing it on those values (each target corresponds to each solvetime, so keep them together), but in trying to understand why his nextTarget = LWMA(STs*targets) / T (ST = solvetime, T=target solvetime) was so much better than my old testing of nextDifficulty = LWMA(difficulties/STs) * T I thought a lot about averaging and came to believe the following should be better: nextTarget = avgTarget*LWMA(STs) / T which is the same as nextDifficulty = harmonicMeanDifficulties/LWMA(STs) * T Testing revealed this to be just enough better (not greatly, but measurably) that I abandoned WT-144.

Dgenr8's method made me realize simple moving averages like this next_Target = average(ST)\*avg(target)/T work better and are more precise than next_Difficulty = avg(D) * T / avg(ST) * 0.95 Intuitively speaking it should have been more clear because the STs vary a lot, hitting a lot of solvetimes below the average and small values in denominator cause big changes. The 0.95 is a correction factor for N=30 to get the correct solvetime and it gets small as N increases, nearing 0.99 at N=100. The effect is more clearly seen in a different way. This: next_D = T*avg(D/ST) gives terrible results but this next_D = T/avg(ST/D) gives perfect results. The first is avg(1/X) while the 2nd is the correct 1/avg(X). BTW, 1/avg(1/D) = harmonicMean(D).

EMA Discovery @jacob-eliosoff noticed our work on trying to get BCH a new DA and suggested a specific type of EMA. He seemed to derive it from past experience and intuition in an evening or two and maybe solidified his thinking from a paragraph in Wikipedia's moving average article that seems to have been there only by chance. It always gives perfect solvetimes with any N, so it's the best theoretical starting point. Dgenr8 suggested I try to combine it with an LWMA but I didn't try it because I could see it looked already to be about the same thing, just exponential instead of linear. Dgenr8 then did it himself, and it came out to be almost exactly the same (to our surprise) but his version could handle zero and negatives. He had basically inverted it and come up with something that has deeper theoretical significance if not a little more accuracy. It was yet again showing working in terms of target is mathematically more correct than working in terms of difficulty.

The EMAs do not do as well LWMA in my testing. One thing that makes LWMA shine is its ability to drop quickly after a hash attack. The only avenue for improving difficulty algorithms is to somehow getting one to rise as quickly as the LWMA falls without causing other problems. There are ways to do better but would require substantial changes (such as @gandrewstone's idea to allow difficulty to fall if a solvetime is taking too long and Bobtail that collects the best hashes for each block, not just one, originally for reasons other than difficulty). [ update: I went full throttle on andrewstones idea, making symmetrically and theoretically perfect with EMA. See "TSA article" ]

The EMA is better than the original LWMA but not quite as good as the new one. Jacob recommended using the EMA because it could be simplified with an e^x = 1 + x substitution in addition to not needing a loop. I opted to "advertise" the more complicated LWMA. I wanted to be sure we had the best for small coins, especially in its ability to drop after an attack. After showing them the results of a more complicated Dynamic LWMA, devs already aware of complexity problems want me it anyway (it switches to smaller N during hash attacks).

I found a section in a book that briefly described how to use the EMA as a PID controller and employed it. This is something many people have considered. But, as I expected for theoretical reasons, it could only do as well as our other algorithms, not better. But like the others, it acts in its own interesting way.

The Stampede for LWMA After Masari rapidly forked to be the first to implement LWMA in early December 2017, there was a stampede to get it to stop new catastrophic attacks that were occurring from Nicehash if not ASICs that people were not yet aware of. Zawy v1 was no longer sufficient protection as the oscillations all day were bad. Bigger coins on the default CN algo were beginning to find it as unacceptable as the smaller coins always had. Now (5 months later) there are over 30 coins who have implemented LWMA or are about to. About 1 coin per 3 days is getting it.

History of LWMA issues & improvements

  1. Masari was in a rush and therefore did not want to redo variable declarations to allow negative solvetimes. @thaer used if ST < 1 then ST =1 which allows a catastrophic exploit. After seeing this in his code, I immediately showed him @kyuupichan's (?) method of handling out-of-sequence timestamps. Two weeks later he forked to do the repair. Unfortunately, other coins had copied his initial code without us knowing it which led to catastrophic exploits with one coin losing 5000 blocks in 2 hours.

  2. Cryptonote coins sometimes kept old Cryptonote code that caused problems. One problem was a sort of the timestamps before LWMA was applied which is a milder form of the problem above. Another problem was keeping in the 15 block lag which is basically a 15 block gift to an attacker, making LWMA seem slow to respond. Cut should also be changed to zero.

  3. Several coins did not change their N. It appears they just briefly read about it and forked to implement it. One used N=17 left over from Karbo and Sumo. Another left N = 720 from CN default. Another used N=30. All of these were clearly not as good as the recommended N =~ 60.

  4. In December I told BTG about there not being any fix to > 50% miners who apply a bunch of forwarded timestamps. My +7/-7 limits in the code only stopped < 50% bad timestamps. I then realized the FTL could be lowered to stop even > 50% attacks from causing harm. I started telling LWMA coins to lower their FTL and shortly thereafter the first attacks using bad timestamps on coins who had not changed the FTL began. They all had to fork ASAP to correct it. Some have not. Coins copying others' LWMA without changing their FTL is a continuing threat.

  5. IPBC copied my recommended LWMA code which was a Karbo version with some variable declaration problems corrected by Haven. IPBC's testnet ran fine. But like most other coins, they did not know how to send it out of sequence (forwarded) timestamps in order to generate negative solvetimes. Soon after they forked, a negative solvetime came though and difficulty went to zero because the denominator was really high. A line of code attempted multiplication of a signed and unsigned integer, i.e. : double = size_t * int64_t. The high bit for sign in int64_t was treated as a really large number. This is a very basic mistake 3 of us should have caught before IPBC tripped over it and had to debug the code to identify the problem. Intense did not see this change and has to fork a 3rd time (first time they had the old Masari code). I feel responsible for half the forks and bloating people's code with multiple LWMA's. The main problem is that too many needed it too fast.

  6. A minor but ugly bug. The timestamp and cumulative difficulty vectors in CN are sized to N before they're passed to LWMA. The fix is to set difficulty_window to N+1 in the config file, then get N inside LWMA from N=difficulty_window - 1. In some LWMA versions out there originating from the initial Masari code, the LWMA is using N=59 when it looks like N=60. There's a quiet resetting of N at the top of the LWMA code. Further below that, there's an "N+1" that would have been correct if N were correct, but since it's not, it should result in a 1/60 = 1.7% error in difficulty. But it turns out that due to integer handling, there's a round off that should have caused some error, but instead it corrects the other error if N is an odd number.

  7. As discovered in the recent Graft fork, even with the FTL = 500 fix, big miners can still play havoc with CN coins due to a bug in CN. They can "own the MTP" which can result in blocking all other miners and pools from submitting blocks if the "Jagerman MTP Patch" is not applied ( @jagerman ). This is a not an LWMA problem but something that fixes a vulnerability in all DAs if a miner has > 50% power.

  8. Sometimes devs want to make improvements to the LWMA. Normally they want to make it respond faster by lowering N. Graft tried an ATAN() function. All attempts so far have lowered its performance. I'll change the LWMA page if a modification shows an improvement. A Dynamic LWMA is in the works.

  9. With the FTL limit lowered, I can remove a +7xT timestamp limit. This will allow it to drop faster after bad hash attacks without losing bad timestamp protection. I have been hesitant to remove the +7 because if a coin does not use it and does not set their FTL correctly, a catastrophic timestamp exploit is allowed even with < 30% hashpower. I could make the code do an if / then based on FTL setting to set or remove +7xT limit. I have opted to change the -7xT limit to -FTL because negative solvetimes should only occur if there was a fake forward timestamp. If there is a fake reverse timestamp, to drive difficulty up (there can be profitable reasons for this, not just malicious) then the tighter reverse limit helps.

  10. A certain theoretical malicious attack is now stopped by using MTP=11 instead of 60.

  11. Oscillations in LWMA start appearing. In March 2018 Niobio started LWMA and was showing an oscillation pattern. It soon subsided. Then BTC Candy switched from an SMA to LWMA and saw improvement, but it still had TERRIBLE oscillations. It was constantly on the verge of collapse small CN default coins, but somehow survived. Then Iridium started having moderate oscillation problems before it forked to a new POW. Then Masari forked to a new POW and started having moderately bad oscillations. The POW changes in Monero were appeared to be causing either Nicehash or Asics to start focusing on certain POWs which exposed LWMA's weaknesses: it needed to respond faster and not fall too far after an attack. Wownero and Intense started showing minor oscillations in May. About 15 other LWMA coins were still doing good. I tried a Dynamic LWMA and Slipstream EMA, both of which I later determined were likely to cause as many problems as they solved, so I abandoned them. Karbo, Technohacker, Niobio, and Wownero were already coding Slipstream and running testnets when I abandoned it. BTC Candy made a change about May 17 that is so far (only 6 days) working great for them. My review of their code indicates it may not help the smaller types of oscillations.

  12. The oscillations continued on BTC Candy and Masari into June 2018, while being minor on other coins. Looking at it closely, it had the appearance of a single big miner who loves the coin, and is willing to constantly mine it at a higher difficulty than others. The truth is that the POW seemed to make it more profitable for a smaller goup like ASICs or Nicehash than GPU miners. The problem with Masari was repeatedly clear 5 to 10 times per day: big "miner" drove it up 20% to 30% in 10 to 20 blocks, then stopped. Since other miners were not present, that block took 7xT to 20xT to solve instead of 30% longer. Seriously clear: last 3 blocks took T/10 and the 4th one took 20xT. The long delay caused a 20% to 30% drop, and the miner came back, repeating the problem. Stellite did like BTC Candy and turned my "Dynamic LWMA" idea of triggering on a sequence of fast solve times into a fast rise without a symmetric fall which I was sort of opposed to on mathematical grounds. But the worst that can happen is that average solvetime will be a little too high and that will occur only if there are constant attacks. Their main net results were better than LWMA, so I experimented with many different modifications and came up with some a lot simpler than theirs and I believe better with LWMA-2.

  13. In June 2018, four Zcash clones started looking into LWMA, so I refined BTG's code to be more easily used in those BTC-like coins. Over 50 coins have LWMA on mainnet or testnet. Several have LWMA-2 on testnet and are planning to switch to it in the next fork.

  14. In September 2018 a > 50% selfish mine attack was performed on an LWMA that was able to lower difficulty and get 4800 blocks in 5 hours. It's the result of an error I made in handling negative solvetimes that was present most of LWMAs. The attack used a method I seem to have discovered and publicized 60 days earlier, but I did not realize the attack applied to my own algorithm. It still applies to LTC, BCH, Dash, and DGW coins, unless there is a requirement outside the DA to limit out-of-sequence timestamps more strictly than the MTP, or unless the MTP is used as the most recent block in the calculation. A different method is used in LWMA-3 and LWMA-4 so developers do not need to do work outside the algorithm, and so that there is not a delay caused by using the MTP protection like Digishield does (which adds to oscillations). LWMA-3 refers to coins that have the timestamp protection, and probably contain LWMA-2's jumps.

  15. November 1. It looks like LWMA-2's 8% jumps (when sum last 3 blocks solvetimes < 0.8xT) may not have helped the NH-attacked coins. Lethean, wownero, graft, and saronite still look very jumpy and have ~10 or so blocks delayed > 7xT which normally almost never occur. Bitsum's LWMA-2 made things a lot worse because I did not realize until now that negative solvetimes make LWMA difficulty jump up a lot more. LWMA-4 has been modified to refine the LWMA-2 & 3 jumps to be more aggressive and yet a lot safer (not make things worse). It required a lot of work and makes the code a bit more complicated. Coins not needing it should stick with (fixed) LWMA. LWMA-4 also introduces using only the 3 most significant digits in the difficulty, forcing the rest to 0, for easier viewing of the number. Also, the lowest 3 digits will equal the average of the past 10 solvetimes, helping everyone to immediately see if there is a problem.

  16. About January 2019: although LWMA 2, 3, and 4 seems better on most coins than LWMA-1, when there is a persistent problem like there has been on Wownero, they seem to make it worse. Also 2, 3, and 4 may bias the performance metrics to look better than they are. For these reasons I've deprecated all versions except LWMA-1.

  17. May 16, 2019. I discovered lowering FTL to greatly reduce timestamp manipulation results in allowing a 33% Sybil attack on nodes that ruins the POW security. Many devs from many coins were involved in the FTL discussions specifically to determine if there was a reason not to lower it too much. I blame the original BTC code for not making the code follow its proof of security. If changing a constant affects the security of an unrelated section without warning, it's not good code. If your coin uses network time instead of node local time, lowering FTL < about 125% of the "revert to node time" rule (70 minutes in BCH, ZEC, & BTC) will allow a 33% Sybil attack on your nodes, So revert rule must be ~ FTL/2 instead of 70 minutes. If your coin uses network time without a revert rule (a bad design), it is subject to this attack under all conditions See: https://github.com/zcash/zcash/issues/4021 for more details.

Problems I've been a part of Other than wasting Zcash developers' time and sending Cryptonote clones down the path of Zawy v1 SMA N=17 when the default Digishield would have been better (or just lowering their N and removing the lag, which is basically what Zawy v1 was except N was too low) other problems I've been at least involved in:

1) Bitcoin Gold also suffered greatly from following my recommendation of converting their Zcash-derived Digishield to an SMA N=30. It may have worked fine if not substantially better than Digishield (with the default values but not an optimized Digishield), except there is a +16 / - 32% "POWLimit" in Digishield done in a way that is derived from BTC's 4x / 1/4 POW limit. I thought the POW limit was referring to a per-block allowable decrease in difficulty. The Digishield terminology implies the creators of the limit thought so too. It's checked and potentially applied to every block, and a 16% / -32% limit makes send, although I objected to it not being symmetrical. It would have been not nearly as bad if the limit were symmetrical. But I knew a 16% limit on D per block would rarely be enforced, so I acquiesced to keeping the limits in and not even bothering to make them symmetrical. Keep in mind BTG was trusting me to make the correct final decision and did exactly as I recommended. The result has been a constant but limited oscillation and a block release rate that is about 10% too fast. The 16% actually limits the per block increase to 0.5% to 1.5% exactly when it is needed the most. I had not tested it as it was coded, but tested my incorrect interpretation of it. It may have been OK, maybe better than Digishield, if the limits were symmetrical. THe limits are almost never reach in Digishield because it is too slow in responding to activate the limits, except in startup where it causes a 500 to 1000 block delay in reaching the correctly difficulty. The error is a much milder form of BCH's EDA's asymmetry mistake. They would have been fine if it were symmetrical, i.e. if the allowed increase were as fast as the decrease. Since BTG had reached $7B market cap, this was the biggest mistake in economic terms.

2) First error in LWMA: see item 1) in above history of LWMA. I didn't create the error, I found it pretty soon, and I got them to correct it ASAP. But I was involved. I did not actively watch to see what code they were writing. I was unable to stop other coins from copying it.

3) 2nd and 3rd errors in LWMA: again see history above, items 2) and 3). Again, it is and is not my fault for the same reasons.

4) Item 4 in LWMA history was "caused" by LWMA being a lot faster. Like the BTG problem, a faster algorithm exposed an existing ... feature ... in someone else's code. It could only be exploited by a > 50% miner, so some (like a Monero commenter) would say it's not an issue we can or should worry about. However, > 50% miners are not intentionally harming coins unless they can find a way to lower difficulty while they are mining. I found the way to address this problem: simply lower the FTL. This issue also demonstrates something: > 50% "attacks" are daily life for possibly every coin that is not one of the top 20. Zcash is the 2nd largest for its POW and sees > 200% mining increases (66% miner(s)) come online to get 20 blocks at cheap difficulty at least once every day. Problems caused by > 50% miners can and should be addresses where it can. Getting back to my fault here: an unkind perspective is that LWMA caused this. I think a more accurate description is that by using the only way possible to protect small coins (making it faster) a problem was created and repaired, at a cost of making the newest adopters of LWMA fork twice. Technical detail on this problem describing why lower N exposes the problem: difficulty can be temporarily lowered to (N-FTL)/N where FTL is in terms of a multiple of the coin's T.

5) See item 5 in LWMA history. I didn't write the code, but I approved it, and a coin trusted my recommendation.

6) Item 10 in LWMA is something I was aware of from the very beginning, but did not carefully measure the extent of the problem. I thought it was a remote possibility. Again, it is only possible with a > 50% miner. Maybe I had not realized back then that > 50% mining is actually addressable.

7) Item 11. Four devs had started testnets on one of my ideas I had strongly promised would work, but then found out it had an exploitable hole. Several devs were good-spirited to tolerate LWMA-2 constantly being refined ... after I said it was definitely finished each time.