Design and Implement Zcash Safety Mode

bitcartel commented 6 years ago

This ticket is to discuss the rationale, feasibility of and highlight any technical issues related to a safety mode patch/feature.

The intent of the safety mode is:

to be activated in an emergency
to rollback the blockchain to a specified height
to prevent transactions entering the blockchain until further notice
to allow the blockchain to continue to be mined (empty blocks with coinbase tx) until futher notice

To expedite deployment, a branch will track master releases, with the safety mode patch already applied.

bitcartel commented 6 years ago

I am not the originator of the idea; documenting it here to be discussed. If the intent described above is inaccurate, please edit to clarify.

bitcartel commented 6 years ago

Some initial thoughts as to why I think this idea is problematic:

Represents a change to consensus rules and not all clients may agree to implement a safety mode.
Rolling back to a specified height could end up executing a double-spend attack on an innocent party.
To be effective against an emergency, the whole ecosystem needs to deploy and run. When the time comes, some parties may disagree to do this.
Governance problems.
1. Permissionless vs Central command
2. What's the criteria for an emergency? Bail-outs for some?

zookozcash commented 6 years ago

Represents a change to consensus rules and not all clients may agree to implement a safety mode.

I don't understand this objection. People who do not agree won't install and run the version that implements fail-safe mode.

zookozcash commented 6 years ago

Governance problems. ii. What's the criteria for an emergency? Bail-outs for some?

I don't agree that "What's the criteria for an emergency?" is a problem. But, my answer to "What's the criteria to activate fail-safe mode?" is: a. If we detect a rollback that is ≥ 10 blocks deep, b. If the Zcash Company decides for any other reason.

Examples of other reasons that could come up in category (b) include discovery of a critical bug or attack that could lead to theft or loss of funds or loss of personal safety/privacy for users. Such things are not quantifiable or objective and it would be up the Zcash Company to decide if a given situation merited activation of Fail-Safe Mode or not.

bitcartel commented 6 years ago

I don't understand this objection. People who do not agree won't install and run the version that implements fail-safe mode.

If a miner produces a block with transactions, there will be a chain split between those running software which implements fail-safe mode and those who are not.

bitcartel commented 6 years ago

Such things are not quantifiable or objective and it would be up the Zcash Company to decide if a given situation merited activation of Fail-Safe Mode or not.

Stakeholders across the ecosystem might make an assessment and reach a different conclusion. Without some governance structure to bind stakeholders to the same decision, the resulting chain split might be worse than the problem itself.

tarrenj commented 6 years ago

A "master pause switch" reminds me of the BTC alert system. Much of the reason that was removed was to decrease centralization. I'm not sure if that's a concern here though.

If it does become implemented, I'd really like to see clear documentation specifying key management, notification procedures, and what merits activation ( as quantifiable as possible).

bitcartel commented 6 years ago

@tarrenj The alert system still exists in Zcash. You can see the alerts that have been sent out here: https://github.com/zcash/zcash/wiki/specification . The alert system becomes less useful over time when there are multiple clients, some of whom may decide not to support a centralized alert service.

daira commented 6 years ago

I suggest to expunge the term "empty block" from our vocabulary. There is no such thing. There are "coinbase-only blocks", but those do contain coinbase transactions and this is relevant to security: if the vulnerability can be exploited using coinbase transactions then restricting to coinbase-only blocks would not be sufficient.

tarrenj commented 6 years ago

After a 51% double spend attack performed last night, I've seriously reconsidered my personal stance on this issue. I now see the merit this would add to smaller projects where the likelihood of an attempted attack is higher. Just like with the BTC alert system, it can always be removed later. Philosophically, I'm no longer against the inclusion of a Safety mode assuming there's clear documentation on key management, notification procedures, and activation requirements as I mentioned earlier.

if the vulnerability can be exploited using coinbase transactions then restricting to coinbase-only blocks would not be sufficient.

@daira

Completely agree. Do you think limiting it to only coinbase TXs would be sufficient, or would you rather the entire chain be stopped?

zookozcash commented 6 years ago

I don't think the concerns in https://github.com/zcash/zcash/issues/3311#issuecomment-394061915 and https://github.com/zcash/zcash/issues/3311#issuecomment-394062807 should delay or prevent the implementation of Fail-Safe Mode. I don't think the problem described therein is as bad as the problem of not having Fail-Safe Mode when we need it.

daira commented 6 years ago

@tarrrenj wrote:

Do you think limiting it to only coinbase TXs would be sufficient, or would you rather the entire chain be stopped?

It's not implausible that a given vulnerability would not be exploitable by coinbase transactions, since they have no inputs. So, if the vulnerability depends on transactions with inputs, it would not be exploitable.

However, I strongly believe that our plan needs to be more flexible and should not depend on the preprepared branch doing the right thing for a given vulnerability or attack. I would guess that the fixed strategy described above would do something useful against maybe half of plausible attacks, and in almost all cases there will be some better strategy tailored to the attack that is still simple and immediately deployable.

bitcartel commented 6 years ago

"What's the criteria to activate fail-safe mode?" is: a. If we detect a rollback that is ≥ 10 blocks deep, b. If the Zcash Company decides for any other reason.

What's the criteria for ending Fail-Safe mode?

If there are a set of conditions which lead to the activation of Fail-Safe mode, and those conditions cannot be verified to have changed or have been mitigated against, when can Fail-Safe mode be disabled?

nathan-at-least commented 6 years ago

I'm going to address several issues raised in this thread as separate comments.

First, terminology: this will be called "Zcash Safemode" (unless the alert system uses "safemode").

From @bitcartel in https://github.com/zcash/zcash/issues/3311#issuecomment-393605944

Represents a change to consensus rules and not all clients may agree to implement a safety mode.

The Safemode release is like any other release we make that activates consensus rule changes: users decide to either install it or not. One very important difference is that our planned Network Upgrade releases have activation heights set months in advance, whereas a Safemode release may have an activation height that's close in the future or (if it seems technically safe ignoring any "policy / governance" issues) potentially in the past.

Rolling back to a specified height could end up executing a double-spend attack on an innocent party.

True. This is what I consider a "policy / governance" issue. Therefore we need to make our "policy guidelines" very explicit. Those are separate from the technical spec of Safe Mode, though. I'd like a ticket focused only on the technical spec and implementation. First we'll build the sledge hammer, then separately decide when it's prudent to smash which things. :-D

Edit 2018-06-05: This section was incomplete, so I just completed it. HT to @daira for noticing.

To be effective against an emergency, the whole ecosystem needs to deploy and run. When the time comes, some parties may disagree to do this.

This is the same as any other release we make that changes consensus rules. A key difference is that for our planned Network Upgrades, we can (mostly) rely on end-of-support halt (see https://github.com/zcash/zcash/issues/2927#issuecomment-390311283), whereas this will rely on a large enough portion of the network upgrading on time.

Governance problems. i. Permissionless vs Central command

The qualities of permissionlessness vs central command are the same as our standard network upgrades except for timing, and the associated discretion issues that arise from quickly made/executed decisions.

ii. What's the criteria for an emergency? Bail-outs for some?

The criteria need to be nailed down, but they are separate from the technical implementation. Those criteria follow our general security criteria: anything which may lead to loss of funds for a substantial portion of the userbase, something which would compromise privacy guarantees for a substantial population, counterfeiting, and 51% attacks are all examples which may trigger a safemode release.

nathan-at-least commented 6 years ago

In reply to https://github.com/zcash/zcash/issues/3311#issuecomment-394062807

Stakeholders across the ecosystem might make an assessment and reach a different conclusion. Without some governance structure to bind stakeholders to the same decision, the resulting chain split might be worse than the problem itself.

This is true of every consensus rule change. The only difference is timing, AFAICT.

nathan-at-least commented 6 years ago

In reply to https://github.com/zcash/zcash/issues/3311#issuecomment-394115347

A "master pause switch" reminds me of the BTC alert system. Much of the reason that was removed was to decrease centralization. I'm not sure if that's a concern here though.

If it does become implemented, I'd really like to see clear documentation specifying key management, notification procedures, and what merits activation ( as quantifiable as possible).

A key difference is that this "Zcashd Safemode release" is a separate software package from our standard releases. Our standard release won't contain any code specific to this safemode behavior. There will be no automatic or built-in way for ZcashCo to trigger a user's node into adopting this behavior.

Instead, the only way for a user to implement this behavior will be to install the Safemode hotfix release if ZcashCo releases it.

Separately the alert system we inherited from Bitcoin is also still present in Zcash. While it remains there we would probably also use that in the kinds of emergencies the Safemode release is designed for.

nathan-at-least commented 6 years ago

In reply to https://github.com/zcash/zcash/issues/3311#issuecomment-394436023

What's the criteria for ending Fail-Safe mode?

There are no pre-specified criteria for ending Fail-Safe mode. The mode will end when a proper fix is designed, implemented, tested, and released, and then a sufficient population of users upgrades to that fix. That fix will most likely disable Safe-Mode at a pre-planned height.

If there are a set of conditions which lead to the activation of Fail-Safe mode, and those conditions cannot be verified to have changed or have been mitigated against, when can Fail-Safe mode be disabled?

Fail-Safe mode is disabled by users installing a new client that upgrades the consensus rules to remove Safe-Mode.

nathan-at-least commented 6 years ago

I feel like the intent of this ticket isn't clear from the initial description, even though that description looks technically correct. Let me try to describe a rationale for this patch:

A Motivating Story:

Imagine a network-wide flaw is discovered which is harming users. One example might be if a flaw similar to The Internal-H Collision ships in Sapling, or there is a network attack that allows targetting a victim to steal funds, or a maliciously crafted transaction can reveal to an attacker private information from the recipient. Or, perhaps there's a large rollback and we're not certain if this is a 51% attack or there may be a vulnerability in our PoW.

What should we do in any of those cases?

Well, we first want to protect all users. Then we'd want to see if we could mitigate the problem.

What if deciding on the proper mitigation of the problem is difficult? In every case above, a safe analysis and mitigation will be difficult to do, especially under time pressure. Every block that passes means more potential damage.

So, in order to protect users first, before spending the time to analyze the underlying cause and select the best mitigation, and implement and test it, what we want to do is protect users. Issuing an "alert" is probably the best immediate response. However, alerts do not alter consensus, and transactions may still proceed after an alert is received. There is also no guarantee when and in what order nodes will receive alerts (and they can be maliciously dropped).

Therefore, a reasonable approach would be to release a hotfix that protects users. What should that hotfix do? Depending on the nature of the vulnerability, different hotfix behaviors may be better or worse. However, deciding on different hotfix behaviors will take time, and the more options there are, or the more complex the changes, the greater the risk that the hotfix will introduce new vulnerability.

Once we've decided on what a hotfix should do, we need to implement it, test it, then cut the release.

Then we need to contact all users and the public and say "Hey, there was this problem X, and we deliberated and decided the best behavior to protect users right now is Y, so we made a hotfix, please install it as soon as possible."

Then we need to monitor the progress of that hotfix deployment.

Motivation for Safe Mode:

With that story in mind, consider what this Zcash Safe-Mode patch provides:

The Safe-Mode behavior is chosen to be "the simplest behavior that will work in many cases". This means we can decide to release it without much deliberation. Therefore, we can save precious time when deciding what a temporary mitigation should be, since we already have a "default temporary mitigation for most problems."
Because Safe-Mode is specified ahead of time, we can code it up, review it, and test it. All of those steps saves invaluable time during an emergency.
Because Safe-Mode is coded up, reviewed, and tested, and we have this specific plan, we can pre-announce to users and the public the fact that we have this contingency prepared. Now everyone will know it's an option. Instead of the public announcement above, where we have to describe why we think the (hastily designed and implemented) temporary mitigation behavior is good, we can instead say "Remember the Safe-Mode contingency plan? We're doing that now." Users will already know what it is and what to expect and their decision will be that much easier.
Because Safe Mode is defined in advance, ZcashCo can use less discretion during an emergency (because we don't need to consider more sophisticated temporary mitigation behaviors). I argue this actually reduces one of the drawbacks to centralized control: open-ended discretion. Compare these two hypothetical announcements from ZcashCo:

We believe there's an ongoing attack that can compromise our difficulty-adjustment algorithm, so we quickly got together and decided to alter the algorithm by extending the window of block difficulties for input, and also add this restriction to timestamps. Please install this as soon as possible.

-versus-

We believe there's an ongoing attack, possibly related to our difficulty adjustment algorithm, but we're not 100% certain. In any case, we've released Zcashd Safe Mode as we previously announced we might do in this kind of situation. Please install it as soon as possible.

In my opinion, we are already in the first category (along with every other cryptocurrency with any hint of centralized development). That is, during a disaster, the dev team may advocate totally arbitrary behavior changes on short notice, using the disaster as rationale for the changes.

By contrast, if we pre-commit to a well known behavior, we are reducing our discretion. Think about it, if we publish a Safe-Mode patch, and tell people what it does and that we plan to use it in an emergency, and then during an emergency we say "Actually, instead of Safe-Mode, we decided it's better to tweak the difficulty adjustment algorithm as follows: …" then something weird is happening and people should be extra suspicious. (See caveat below at the end of Hard Edge Cases.)

By this reasoning, pre-committing to the Safe-Mode behavior and release plan makes users safer because we can react quicker to an emergency and reduces our discretion somewhat.

Hard Edge Cases:

Notice that almost all difficulty surrounding releasing a Safe Mode hotfix is around deciding whether or not to do so and the activation height. (As above, I argue we already have a wider space of difficult potential decisions, and Safe-Mode helps us narrow that difficult space up front.)

Activation Height

This Safe Mode still has an important point of discretion about the activation height. An activation height which is in the past relative to the release date is potentially problematic, because it reverts transactions and therefore harms a semi-arbitrary set of users to the benefit of others. A height in the future of the release date potentially allows a known harm or attack to proceed (and may even stimulate an attack).

For example, consider we discover a vulnerability that allows counterfeiting, but we suspect no attacker has used it yet. If we choose a height in the future and say "please install Safe Mode which is set to activate tomorrow", then a suspicious attacker can immediately counterfeit. If we choose a height in the past and say "please install Safe Mode which rolls back transactions since an hour ago", the attacker has no chance to react. (Tangent: this focuses on the chain-split branch that enables Safe Mode. Potential chain-splits complicate much of this analysis.)

Another important example: we detect a long rollback. Let's say ≥ 10 blocks and that rollback is relatively recent, and that the common ancestor is at height H. Then if we set the Safe Mode activation height to H, we would be rolling back both branches of the rollback. If an attacker had already used the rollback to steal money from an exchange, this doesn't change the fact that the exchange has lost out, necessarily. But it does mean the attackers double-spent funds are now frozen at the attacker's original attack address.

We may need to choose, as a policy, to only select heights "in the future" on the theory that reducing our discretionary power. However, this would be "only" a policy (so we could violate it) and attackers may leverage that policy depending on the style of attack. Even though policies can be changed or violated, they are still useful as a signal that something unanticipated is occurring, and policy violations are suspicious.

Unmitigated Disasters

There is also the edge case of disasters that are not mitigated by the Safe Mode behavior. @daira brought up one case, where the coinbase transaction itself can leverage the attack. Any case where the generation of new blocks exacerbates the problem are also not addressed by Safe Mode.

These are problematic because the original motivation was to train users that we will always use Safe Mode and if we advocate a different temporary mitigation, something is weird. So if we say "Oh, Safe Mode doesn't help against this, but there's this other hotfix over here that does." that's messy.

I think we can largely improve this by pre-identifying the kinds of failures this Safe Mode cannot protect against, including coinbase transaction attack vectors, or some kinds of PoW or block header vectors.

Edit 2018-06-05: A few typo fixes and clarifications. HT @zookozcash

tromer commented 6 years ago

There's still the question of whether safe mode is activated by a mainnet alert or by a release that (rapid as it may be) needs to be manually installed by users.

The scenarios and considerations raised in @nathan-at-least's comment, above, call for nodes to enter safe mode rapidly and simultaneously -- for which alert-based activation is much better. If users need to manually upgrade their software, then we're probably looking at variability on the order of days, which makes the activation height tradeoffs much harsher (reverting days' worth of legit transactions vs. giving an attacker days to act undeterred).

Of course, there's the centralization issue: someone needs to hold a secret key to sign "enter safe mode at block height nnn" alerts. But someone also holds the credentials for signing and announcing releases. The only difference is whether following their decision is automatic or manual. And in an emergency situation, most users would not be in position to exercise informed judgment about whether to accept the release -- they would have to trust whoever has the release credentials. So why pay the security and consistency cost to delay the inevitable?

tromer commented 6 years ago

On related note, what is the most nefarious attack that can be done by whoever holds the credentials for activating safe mode (whether by release or by alert)?

They can, of course, DoS the network. But I think that any fund-theft attack (e.g., retroactively freezing the chain to before they made a large payment) could be fixed, with no collateral damage, in the subsequent release that disables safe mode.

daira commented 6 years ago

@nathan-at-least wrote:

@bitcartel wrote:

To be effective against an emergency, the whole ecosystem needs to deploy and run. When the time comes, some parties may disagree to do this.

This is the same as any other release we make that changes consensus rules. [...]

It's not, actually. Network Upgrades are bilateral hard forks; this is a soft fork (contracting rule change). That's important in order to avoid a persistent chain split — vulnerable nodes will reorg to the majority chain if the fixed nodes have a majority (unless they would have to reorg more than 100 blocks in which case they will halt). That would not happen with a Network Upgrade. So let's not conflate this with upgrades.

daira commented 6 years ago

@nathan-at-least wrote:

A key difference is that this "Zcashd Safemode release" is a separate software package from our standard releases. Our standard release won't contain any code specific to this safemode behavior. There will be no automatic or built-in way for ZcashCo to trigger a user's node into adopting this behavior.

Debatable. As I pointed out above, because this is a soft fork, if a majority of nodes adopt it then the minority will be forced to accept it, because their nodes will either reorg to the majority chain or halt. (Note in particular that if the activation height implies a rollback, the minority are forced to accept the rollback.)

daira commented 6 years ago

"RPC safe mode" is already a thing. So this feature proposal can't be called "Safe mode" without causing confusion.

daira commented 6 years ago

@nathan-at-least wrote:

The Safe-Mode behavior is chosen to be "the simplest behavior that will work in many cases". This means we can decide to release it without much deliberation.

That is precisely what is worrying me. I really don't think we can, and I think that if we do, we'd be doing more harm than good in an alarming proportion of cases.

By contrast, if we pre-commit to a well known behavior, we are reducing our discretion. Think about it, if we publish a Safe-Mode patch, and tell people what it does and that we plan to use it in an emergency, and then during an emergency we say "Actually, instead of Safe-Mode, we decided it's better to tweak the difficulty adjustment algorithm as follows: …" then something weird is happening and people should be extra suspicious.

This in my opinion is a problem, because Safe Mode might not work for a given vulnerability.

daira commented 6 years ago

@nathan-at-least wrote:

Another important example: we detect a long rollback. Let's say ≥ 10 blocks and that rollback is relatively recent, and that the common ancestor is at height H. Then if we set the Safe Mode activation height to H, we would be rolling back both branches of the rollback. If an attacker had already used the rollback to steal money from an exchange, this doesn't change the fact that the exchange has lost out, necessarily. But it does mean the attackers double-spent funds are now frozed at the attacker's original attack address.

Huh? The attacker's funds aren't frozen at all, they can still be re-spent. Forcing another rollback isn't useful in this case.

daira commented 6 years ago

I'm going to start referring to this as "coinbase-only mode" for the time being (that's not a good long-term name), because "safe mode" is ambiguous with the RPC mode.

@tromer wrote:

On related note, if what is the most nefarious attack that can be done by whoever holds the credentials for activating [coinbase-only] mode (whether by release or by alert)?

That kind of depends whether coinbase-only mode is able to force a rollback. If implemented by a release it necessarily can, and can also do anything else; for an alert we don't have to allow this.

nathan-at-least commented 6 years ago

I think the policy decisions are clouding our understanding of the technical specification and different people are assuming different technical specs. So let's nail down the technical specs. Here's what I propose:

Safe Mode is implemented as a "hot fix". Each production release (including RCs) will have a "pre-baked" Safe Mode hotfix release associated with it. This follows all of our other hotfix conventions (ex: version numbers, release pipeline, etc…). Addresses https://github.com/zcash/zcash/issues/3311#issuecomment-394544548.
Safe Mode is a bilateral hardfork. Addresses https://github.com/zcash/zcash/issues/3311#issuecomment-394608215 and reuses the safety mechanisms we've introduced in Overwinter.
Safe Mode introduces these new restrictive consensus rules:

a. A block is invalid if it contains >1 transaction. (Thus coinbase transactions are still allowed.)

b. Coinbase transactions may not include Sprout or Sapling proofs.

c. Coinbase transactions must set the transaction expiry flag to a single, specific well-known value indicating "no transaction expiry". (FIXME: What is that value?)

d. Coinbase transactions must not allow inputs, must be smaller than some practical limit, and must fit some practical but limited form. (For example, if we find that every coinbase currently in existence uses a specific subset of ScriptPubkeys, we may define Safe Mode rules to require those exact templates.)

e. (Implied by 2?) Coinbase transactions must use a new unique versioning identifier.

That's my first pass at specification. The goal for rules in 3 is that we especially need to be prepared for critical vulnerabilities in the Overwinter or Sapling upgrades, since that's substantial new code, plus any other vulnerabilities that use transactions as a vector.

Complementary to this technical specification it is valuable to delineate which kinds of vulnerability this does not protect against. Here's a probably non-comprehensive start:

vulnerabilities in the PoW algorithm are not mitigated.
vulnerabilities in the difficulty adjustment algorithm are not mitigated.
vulnerabilities in the restricted form of coinbases are not mitigated.
vulnerabilities in block header processing are not mitigated.

Edit right after post: list formatting.

nathan-at-least commented 6 years ago

In reply to @tromer in https://github.com/zcash/zcash/issues/3311#issuecomment-394544548

There's still the question of whether safe mode is activated by a mainnet alert or by a release that (rapid as it may be) needs to be manually installed by users.

The spec I advocate explicitly does not introduce any design complexity into our mainnet protocol. So this isn't code that is present on normal user installs, and instead users must upgrade to activate this code. I have several motivations for this:

Less code complexity on mainnet is better.
Activation by a centrally controlled key places discretion entirely in the hands of key holders. Activation by installing an update retains discretion with users. Consider a few crucial cases:

a. we advocate a rollback and substantial stakeholders in the ecosystem consider that onerous. In this "hotfix release" model, all users interests are protected in this manner. Consider especially if a malicious regime were to compel ZcashCo to attempt to halt transactions, or one particular harmed party is buddies with ZcashCo / Zcash devs, or we simply are unaware of a very important impact when we select a block height, or different Zcash devs disagree (ex: the Foundation and ZcashCo disagree on the path forward).

b. Keys we control can get stolen. Attackers can then economically or strategically benefit at the expense of the Zcash economy. Also, the fewer "magical keys" ZcashCo or Zcash devs in general hold the less of a target they are.
The alert system already exists, and we can use it, but we want to remove it eventually (for similar reasons as the last list item.

daira commented 6 years ago

Woah, hold on.

A lot of the design decisions that went into the Network Upgrade mechanism were made based on the explicit assumption that upgrades would be released at least (EOS halt period) before the activation height.

If we use the Network Upgrade mechanism for emergencies, then several of those decisions no longer make sense. In particular, we chose to do bilateral hard forks on the basis that the main disadvantage of that kind of fork –that it always technically causes a chain split– would be mitigated by the non-upgraded chain being economically irrelevant due to almost all of its nodes having halted. That would not be the case for this emergency usage.

daira commented 6 years ago

Another important issue is that given the inability to roll back more than 100 blocks, starting to run the emergency release later than 100 blocks (~4 hours 10 minutes) after the activation height, will not work at all without reindexing. (Maybe that could be mitigated by having the emergency release always reindex on first run, which would have other advantages in case the reason for the emergency involves chain state corruption.)

bitcartel commented 6 years ago

Regarding the example of rolling back when a double-spend attack is detected:

With a Sprout rollback, transactions are placed back into the mempool and available for mining again.
With an Overwinter (or later) "safe mode" rollback, the combination of transaction expiry and the mining of empty blocks will result in all expiry-height based transactions not being mined and getting evicted (since fixing the issue and disabling safe mode will probably take longer than the time to create the rollback period).
- This means that rolling back can create new double-spend victims. For example, a merchant may have promptly delivered goods and services to a consumer, but the rollback will result in loss of funds when the consumer's payment transaction expires.
- Now consider if a double-spend attack were to occur on the weekend. It might take 24-48 hrs before humans at Zcash company determine the criteria to trigger safe mode has been reached. By then there might be 576-1152 blocks to roll back, and at today's rate of 10,000 tx/day, this could mean rolling back 10,000-20,000 transactions which end up expiring.

nathan-at-least commented 6 years ago

On Network Upgrade mechanism, in reply to https://github.com/zcash/zcash/issues/3311#issuecomment-394835682

@daira, are you suggesting that an emergency hotfix release in which we do not know a specific safe mitigation (perhaps because we suspect an attack vector in transactions, but we don't know what the specific vector is) should not be activated as a bilateral hardfork with explicit changes in versioning?

On reindexing, in reply to https://github.com/zcash/zcash/issues/3311#issuecomment-394837288

Great point. Let's require Safemode to reindex.

On rollbacks, in reply to https://github.com/zcash/zcash/issues/3311#issuecomment-394919805

Yes, rollbacks have substantial costs. However, are you advocating we should never rollback? What if we had a bug like https://en.bitcoin.it/wiki/Value_overflow_incident which completely destroys the entire purpose of being for the network?

Again, notice that we're debating some technical points (ex: we need to reindex, or we need to figure out a general purpose safe-as-possible activation mechanism) with policy concerns: rolling back too far is damaging.

The policy decision about activation height is completely separate from the technical implementation. Imagine that two days from now we discover a critical flaw that fundamentally compromises the reason for the existence of the Zcash network, and if transactions were disallowed beyond some height, then Zcash would continue to exist. Now, should we have a tool ready for that contingency, or should we just wait until that occurs, then after the disaster has begun start arguing about implementation details and appropriate policies?

Let's first make a tool that can save people's property (and potentially privacy), then after we have it, let's think about guidelines for when to use it. Let's install lifeboats in our ocean liner first and then decide when it's acceptable to abandon ship.

zookozcash commented 6 years ago

Daira pointed out that the word "safe mode" already means this related feature that Zcash inherited from Bitcoin: https://en.bitcoin.it/wiki/Alert_system#Safe_mode

There are two related ideas that we need names for. The first is the functionality of a local client returning errors and refusing to process certain RPC APIs once it has received an alert. The second is having a pre-built "hot fix" binary which changes the consensus rules to disallow transactions (except for coinbases).

We need words for these two things.

nathan-at-least commented 6 years ago

Proposed (incomplete) spec:

The new name is "Zcash Emergency Mode".
Coinbase only mode: drop mempool txns, everytime tx comes to p2p drop it
Activation is a "soft fork"-style at a well-known height.
Implemented as a config option that takes the height, so that for activation we can tweet instructions. Danger: What about spoofers tricking victims into enabling it?

[Edit 2023-09-16: We cannot use "Emergency Mode" either, because that name has been used for ECC's response to the "sandblasting" denial-of-service attack that started in June 2022. I've changed the title of the issue back to "Zcash Safety Mode", in the absence of a better name.]

tromer commented 6 years ago

Making it a config / commandline option is an excellent idea! Reduces deployment time and the need for users to check what software they've just downloaded, under time pressure.

Should maintain an up-to-date list of instructions for how to configure it in packages that hide the zcashd executable, such as WinZEC.

zebambam commented 6 years ago

So as a bit of context, if this is a system to stop attacks, they happen in minutes.

If there were a bug that is exploitable via transactions that someone spots either via some sort of intrusion detection / incident response or via financial reconciliation, we're talking about minutes of response time.

a) It would probably take hours to contact 50% of the hashrate by reaching out to people and organizations individually and having them turn this on,

b) after all of which they might still be on the consensus-losing side when the mode is engaged, putting them at a market disadvantage, possibly damaging their businesses, our reputation and market confidence in zcash along the way.

c) Years into the future, if this feature is still available, opt-in at the point of install is going to fail for sure, because in general i) the default install of software /is/ the software in 99% of cases. Almost nobody reconfigures things (source: admin/admin gets you into nearly all routers) and ii) there's a hidden third category of users after "Yes, I'm in, I love emergency mode", "No, I'm out, no emergency mode for me" and that category is "Don't know, don't care". Those people will use the default.

If we want widespread adoption of a zcash implementation that has this feature enabled, we're going to take on users who are apathetic about our design principals and just want a product that works. I'd argue that we already have such users, but certainly we have to cater for them.

If we're going to achieve effective consensus on this mode that strengthens zcash, its community and market confidence in zcash, that needs to be done at the point users install the software through our existing opt-in update system, and we ideally need to know that the mode will be effective before we engage it. It's not clear how to implement that idea without causing a hardfork, although I do have some ideas around offering that information (do you support emergency mode) to peers, which we could potentially use to determine if a zcash network supports emergency mode or not.

zebambam commented 6 years ago

I also think that any discussion of roll-back should be kept away from the design for this mode. We should not support a roll-back without implementing that as a separate feature. After some thinking about the consequences of doing one for our ecosystem, it seems like a feature that has rollback baked-in would be unpopular with almost everyone.

daira commented 6 years ago

I agree with @zebambam that this mechanism, if we add it, needs to work quickly.

I propose the following design:

There are several "triggers" that impose restrictive consensus changes. For each such trigger, for each network (testnet, mainnet, regtest), there is a "magic UTXO". Spending from that UTXO activates the trigger for blocks after the triggering block. There is no way to deactivate a trigger other than performing a network upgrade supported by a new release.

Example triggers:

TRIGGER_COINBASE_ONLY - blocks can only contain coinbase transactions;
TRIGGER_NO_SHIELDED_SPROUT - transactions cannot contain JoinSplits;
TRIGGER_NO_SHIELDED_SAPLING - transactions cannot contain Sapling Spends or Outputs;
TRIGGER_TEST_NO_EFFECT - no effect; used for live test of the trigger mechanism.

New triggers may be added at network upgrades.

The magic UTXOs could be for multisig t-addresses with keys held by the company and Foundation; that's a separate design decision.

The main advantage of this design is that it uses only "on-chain" signalling, and will not cause a chain fork regardless of how many nodes are at which version or when they upgrade.

Note that adding this mechanism is by itself a soft fork.

Credit to @tromer for the "magic UTXO" idea.

zebambam commented 6 years ago

I'm not tied to alerts vs. transactions. I only fascinated on alerts because of the speed of replication, but of course solved blocks are flooded also, and your solution solves the edge cases which we would have had to try to avoid by picking a blockheight we expect is in the future. Your proposal results in a faster overall engagement of Emergency Mode.

Side question: Is there merit to adding triggers to kill just transparent transactions?

daira commented 6 years ago

On the side question:

It's difficult to kill all transparent transactions, because coinbase transactions are transparent, and (even if a bunch of assumptions in the code about blocks always containing a coinbase tx were removed) what incentive would miners have to mine if not for coinbase transactions?

There might be merit in having a "TRIGGER_ONLY_COINBASE_AND_FULLY_SHIELDED" trigger, but I'm not sure what plausible vulnerabilities that would prevent. I guess coinbase transactions don't have inputs, so this does cover potential bugs that are only exploitable via transparent inputs. In any case we shouldn't have a proliferation of different triggers unless we can make a case that each one might cover a set of plausible vulnerabilities.

zebambam commented 6 years ago

Right, sorry, I meant transactions with inputs (or that's the only way my comments make sense, so I'll take it ;) it seems to me that this is a possible vector for vulns or possibly a component of an exploit and I know we'd be kicking ourselves if we considered it and then decided not to add it.

I don't like the idea of any switch that stops mining / coinbase transactions, because I think that would do way more harm than good by interrupting the ecosystem, causing uncertainty that I think would be more disruptive to the ecosystem than just having a critical bug for a while.

Shielded transactions also have inputs, so I guess it hinges on what we think the likelihood is that a vuln would be found in transaction inputs (or that requires transparent transaction inputs to exploit) that affects only transparent and not shielded transactions. Seems like if you're trying to load data into the address space of target processes, you can do that by flooding transactions, you don't need to get something permanently into the ledger. Hmm maybe I'm talking about stopping transaction flooding for exactly that reason.

zebambam commented 6 years ago

There's probably a ton of other ways of doing that, but if we uncovered an exploit in the wild that used this method, I guess stopping the flood temporarily until we fix the underlying cause could be a useful tool.

zebambam commented 6 years ago

But then why would you use something in an exploit that you know could be turned off at the first sign of trouble, unless it was the only option.

daira commented 6 years ago

Well, if you want to short Zcash, you could launch an attack and then you win either if the attack continues or if an emergency trigger is invoked, because the price would tank either way. (That doesn't mean that emergency triggers are a bad idea; just that they don't solve that particular problem.)

zebambam commented 6 years ago

So if we restricted engaging to only situations where direct exploitation would also cause devaluation (assuming it were spotted at all) then we're just talking about magnitude of price shifts rather than causing one where there wasn't already. I mean, if you know of such a bug you can just short then publish the bug and watch the price drop. Basically, if someone knows of such a bug but doesn't disclose it to us responsibly, they can cause a price drop whether through exploitation or publishing - we're just trying to pick the path that leads to the strongest market confidence. I'd argue we should look to the longer-term confidence rather than worry about short-term reactionary drops. I think we'd want to communicate well around this too in order to mitigate misunderstandings.

mms710 commented 5 years ago

Moving note from an associated card onto this ticket: Discuss design: from Trello: "failsafe mode patch" - a branch implementing safety mode freeze (e.g. at some block height, only empty blocks with a coinbase transaction are accepted).

zcash / zcash

Design and Implement Zcash Safety Mode #3311