How to Establish the Root of Trust in Zoom’s E2E Protocol? Reliably, Securely and Conveniently Validating the Meeting Security Code via Speech-to-Text Transcription

Meeting-Security-Code-Validation.pdf

Blum et al. recently published a white paper describing the Zoom’s proposed End-to-End Encryption (E2EE) protocol and architecture [1], with a roadmap of work to be done in various phases. Perhaps the most important phase of this protocol is “Phase I: Client Key Management,” where the authors describe the key management protocol based on which the encryption of the media content (audio/video/text) will be performed. This is a leader/host driven protocol that relies on the public key of the leader, whereby the symmetric “meeting key” (mk) is essentially to be encrypted with each participant’s own public key authenticated/signed by the leader’s public key, and distributed to each participant over the broadcast signaling channel (“bulletin board” in the terminology of [1]). That is, when each participant receives the (signed) ciphertext, it will decrypt it to learn the meeting key and also be sure that the meeting key is indeed generated by the public key of the leader by verifying the signature. All of this means that simply sending the public key of the leader over the signaling channel is not sufficient as an attacker (a “man-in-the-middle” or MITM) can insert its own public key over the insecure signaling channel, thereby compromising the security of the entire protocol and E2EE completely. This attack, although active, is extremely easy for the adversary to perform. Since the Zoom’s server controls the signaling channel, it will be a cake walk for an adversary, who has compromised this server or if the Zoom were under coercion from law enforcement, to change the leader’s actual public to the attacker’s own public key. Thus, it is extremely important to address this critical vulnerability. It is not an option, it is a must have. To counter this, it is essential to authenticate the public key of the leader ----- this is precisely what we call as the “root of trust” in the proposed E2EE protocol because if this authentication is not done right, you may lose all security. The game will be over.

In the attached document, we review Zoom’s proposal to validate the authenticity of the leader’s public key, define some very fundamental and subtle security+usability problems with the Zoom’s approach, and then introduce a new solution – foundations of which have already been studied in our recent work (CCS 2014 and CCS 2017) – to address many of these problems. We also provide items for future work that needs to be done towards transitioning this new solution into Zoom’s E2EE protocol in practice. The proposer and his research team is happy to work with the Zoom’s researchers and engineers in making this transition possible. We appreciate the feedback from Zoom.

I think that the voice based verification is to take the client software out of the equation, having a human to human protocol that would remove relying on a possibly compromised software component.

Your proposal removes this layer and has the client software verify the result of a STT - Speech To Text - process with a given known text. This removes the human verification, true, but if you trust the software doing the check, well, you may as well trust the key exchange protocol and skip the speech part.

Am I - very possibly - missing something?

Hi there,

Yes, I think you are probably missing the crux :-)

The client is already trusted, since the client stores the long term private keys and the meeting key resulting from the key exchange protocol. So, if someone compromises the device, they get the keys and the game is over already.

In any end-to-end encryption protocol (and I believe any cryptographic protocol), end points must be assumed trustworthy, uncompromised and non-malicious. Otherwise, no crypto can help secure the communication between the devices. The goal of an E2EE protocol is to retain security when anything (and likely everything) BUT the end points have been compromised.

Please check the CCCP paper -- it details the full threat model. https://sites.uab.edu/saxena/files/2019/12/ss-ccs17.pdf

Regards! -- Nitesh

Thank you for your extremely detailed reply. After a great deal of consideration, we do not plan to incorporate this into our E2EE plan. Given the challenges around finding or training a speech recognition engine (that runs locally, even on low-end devices) which can handle the thousands of languages and accents, not to mention potential challenges around adversarial inputs to the speech recognition engine, we expect that making this work sufficiently well for large-scale production deployment would be a large effort. Instead, we plan to spend that effort on Phases 2 and 3.

We recommend that if you are interested in moving this proposal forward towards real-world deployment, that you talk to platforms which have a Phase 1-like security model, because those reply more heavily on meeting codes for security.

Thanks for the response. It's not at all a problem if you do not plan to incorporate the proposed design. As a researcher who has done extensive work in this problem space, I just felt it was my moral duty to point out the flaws in your design and offer tangible ways to address those flaws.

However, to be precise, I would like to clarify some of the issues you mentioned.

The transcription engine to be used here is not a full-domain natural language processing algorithm. It works with an independent, likely limited-domain, dictionary of phonetically-distinct words. So, language and accents would not be a challenge. Training won't be a challenge either. Local transcription is also not a problem on low-end devices and can be done efficiently in real-time.
Adversarial inputs: The user is still in the loop, verifying the voice/video of the host speaker. Any adversarial inputs or hidden voice commands can be relatively easily detected. Also, based on my group's own research on adversarial inputs, none of the academic literature that actually introduced these attacks has so far demonstrated real-world practicality of these attacks. Further, do you mean that due to a potential -- likely only a theoretical threat -- users should give up on using any devices that deploy speech recognition altogether? This has clearly not happened for hundreds of millions of voice assistant speakers that have been used on a routine basis in almost all households currently.

Related: You are using numeric codes in your phase I roll out. Unfortunately, this represents a very weak design, in light of a simple reordering attack, where the attacker can get voice snippets of all of the digits from one code read out and can copy/paste those snippets to create any new code of its liking to attack the new session. See our CCS'14 demonstrating that such attacks are very feasible: https://sites.uab.edu/saxena/files/2019/12/ss-ccs14.pdf. Note: this is not a deepfake attack, but a very simple, easy to perform attack. If you are concerned about attacks such as adversarial inputs against speech recognition, I think you should definitely be concerned about this rather simple and practical attack. Numeric encodings should NEVER be used.

You mentioned Phase II and Phase III designs, but in my opinion they are actually pushing things away from being end-to-end secure and towards more centralized architectures, as they will start relying upon a first or third-party infrastructure. Also, certificate transparency can not really protect the users from attacks in real-time.

In any case, until you have Phase I running, the issues identified in my proposal and also mentioned above would be a significant problem for real-world users who enable E2EE during this phase.

I would be happy to discuss further, if you have interest and time. Thanks!

zoom / zoom-e2e-whitepaper

How to Establish the Root of Trust in Zoom’s E2E Protocol? Reliably, Securely and Conveniently Validating the Meeting Security Code via Speech-to-Text Transcription #31