General Security Questions

quidity commented 2 years ago

I'm a security reviewer from Google Chrome and have some general questions - they almost certainly have good answers and it's very possible I missed finding them in the spec. I also have some specific suggestions which I will open as distinct issues shortly.

I think I disagree with the answer given in 2.9 in https://github.com/webmachinelearning/webnn/blob/main/security-privacy.md. This is introducing a new scripting language, built up in js then executed in a number of different interpreters (“cpu”, “gpu” etc.). The complexity of the operations is such that the likelihood of errors that ultimately result in out of bounds accesses is high, and malicious sites will have significant control over operations. This could be addressed more directly in the security/privacy explainer.
Operations such as split/slice/squeeze that change the shape of tensors mid-calculation may lead to incorrect assumptions in later operations - for instance if eliding bounds checks this could lead to out of bounds accesses. It would be good for their to be operation level metadata that might be consumed by implementors to help prevent such problems.
The universe of operations is likely to vary in future - how will consumers discover which operations are available (short of enumerating them through failures to instantiate)? How will operations be deprecated (for instance if they turn out to be badly implemented?)
It feels like .build and .compute should be asynchronous in all cases?
New side channels will be made available from shared resources (cpu/gpu). Timeable things should be out of process so incur at least some ipc to achieve anything. Probably not a massive worry when compared with already sharing a cpu between processes running renderers.
Verify: Sites must delegate permission to host/run models.
Verify: No serialization or caching yet - although this is likely in future.
Control over how a model is run - (selecting cpu/gpu/tpu say) - is this too much power for the consuming site - it will for instance make it possible to more directly target a flawed implementation. It's not clear why this is required.

Sub issues:

242
243
244

anssiko commented 2 years ago

@quidity thank you for this security review, much appreciated. We've tentatively scheduled this topic for discussion on our next meeting 13 January 2022 - 15:00-16:00 UTC. I acknowledge this may not be a reasonable meeting time for you, so we'll follow up in this issue afterwards.

anssiko commented 2 years ago

We discussed these security review questions on WebML WG Teleconference – 27 January 2022: https://www.w3.org/2022/01/27-webmachinelearning-minutes.html#t01

The WG will follow up with clarifying questions and proposed changes. This security review is tracked as part of wide review https://github.com/webmachinelearning/webnn/issues/239 of the WebNN spec.

On behalf of the group, I want to thank @quidity and Chrome Security team for this security review!

anssiko commented 2 years ago

@quidity, as shared earlier, the group discussed your Security Review feedback and here's a summary of the group's thinking and proposed next steps. If your think we're going to the right directions, we'd be happy to know that, and even more so if our proposed next steps need course correction!

The group also had some questions to better understand some part of the feedback, in bold below.

Thanks for your review once again!

I think I disagree with the answer given in 2.9 in https://github.com/webmachinelearning/webnn/blob/main/security-privacy.md. This is introducing a new scripting language, built up in js then executed in a number of different interpreters (“cpu”, “gpu” etc.). The complexity of the operations is such that the likelihood of errors that ultimately result in out of bounds accesses is high, and malicious sites will have significant control over operations. This could be addressed more directly in the security/privacy explainer.

The group agreed with out-of-bounds access concern and proposed to address this in the test suite and also note OOB in the Security Considerations.

The group did not agree with the notion that the API introduces a new scripting language and wants to understand what aspects of the API are of concern?

Operations such as split/slice/squeeze that change the shape of tensors mid-calculation may lead to incorrect assumptions in later operations - for instance if eliding bounds checks this could lead to out of bounds accesses. It would be good for their to be operation level metadata that might be consumed by implementors to help prevent such problems.

The group will review each operator identifying OOB access issues and add appropriate guidance to implementers for mitigations.

The group was not sure what "operation level metadata" exactly means in this context?

The universe of operations is likely to vary in future - how will consumers discover which operations are available (short of enumerating them through failures to instantiate)? How will operations be deprecated (for instance if they turn out to be badly implemented?)

The group proposed to document such design principles in the Security Considerations.

To make possible deprecation easier, the high-level functions can be decomposed into smaller operations as described in the explainer. That is, possibly problematic ops could be deprecated and replaced with a polyfill implementation (likely trading some performance for security).

It feels like .build and .compute should be asynchronous in all cases?

This design consideration is being discussed in https://github.com/webmachinelearning/webnn/issues/230 and your feedback is taken into consideration.

New side channels will be made available from shared resources (cpu/gpu). Timeable things should be out of process so incur at least some ipc to achieve anything. Probably not a massive worry when compared with already sharing a cpu between processes running renderers.

The group agrees this is an area worth a clarification, any relevant guidance to implementers to be documented in the Security Considerations.

The group would like to understand how running timeable things out of process works as a mitigation?

Verify: Sites must delegate permission to host/run models.

Correct. The permissions policy integration in place means the top-level must delegate permissions for an iframe to run a model using this API.

Verify: No serialization or caching yet - although this is likely in future.

Correct. No such API affordances exposed, out of scope for the spec. That said, the group is investigating if implementation-level caching warrants guidance to implementers.

Control over how a model is run - (selecting cpu/gpu/tpu say) - is this too much power for the consuming site - it will for instance make it possible to more directly target a flawed implementation. It's not clear why this is required.

The device selection is a hint only (device selection left to the implementation) and there's no device enumeration API. This partially mitigates the concern.

Sub issues:

242

243

244

Please find the group's comments in these sub issues.

anssiko commented 2 years ago

I started a draft PR to incorporate this feedback into the Security Considerations of the specification: https://github.com/webmachinelearning/webnn/pull/251

This PR, once reviewed and merged, will help tick the first box in the Security wide review we're tracking as part of https://github.com/webmachinelearning/webnn/issues/239

To tick the second Security wide review box, we'll explicitly request an "official" (used to be Web Security IG) Security wide review to get even more eye balls scrutinise the security aspects of the spec.

anssiko commented 2 years ago

@quidity, we'd appreciate if you could help answer the questions the group had with respect to your review questions, highlighted in bold in https://github.com/webmachinelearning/webnn/issues/241#issuecomment-1025952050, also copied below for your convenience:

The group did not agree with the notion that the API introduces a new scripting language and wants to understand what aspects of the API are of concern?

The group was not sure what "operation level metadata" exactly means in this context?

(Would addressing https://github.com/webmachinelearning/webnn/issues/243 satisfy all "operation level metadata" requirements?)

The group would like to understand how running timeable things out of process works as a mitigation?

Thank you for your help in hardening the security of this API!

quidity commented 2 years ago

The group did not agree with the notion that the API introduces a new scripting language and wants to understand what aspects of the API are of concern?

My security concern is that the network is compiled into a program that is entirely under an attackers control. While not as powerful as javascript it is likely powerful enough to make some exploits easier as a result. It argues for careful implementation, and avoiding introducing too much control of control-flow at the API side. Hopefully these concerns can be considered as new operations are introduced.

The group was not sure what "operation level metadata" exactly means in this context?

(Would addressing https://github.com/webmachinelearning/webnn/issues/243 satisfy all "operation level metadata" requirements?)

Yes, that would be a helpful step.

The group would like to understand how running timeable things out of process works as a mitigation?

This is a concern of lower importance - mainly adding an ipc step will confuse any accidental high precision timers. Realistically there is not a lot to be done here!

Thank you for your help in hardening the security of this API!

A pleasure, thanks for the detailed response.

anssiko commented 2 years ago

PR https://github.com/webmachinelearning/webnn/pull/251 has been updated to address the following:

"Notion that the API introduces a new scripting language"
"Running timeable things out of process"

The remaining work from this issue #241 is to decide how to address:

https://github.com/webmachinelearning/webnn/issues/243

Our last discussion https://www.w3.org/2022/02/24-webmachinelearning-minutes.html#t06 on that topic was inconclusive. I propose we track #243 separately and mark this "General Security Questions" issue resolved when PR #251 is reviewed and landed.

Thanks for your contributions everyone. We're making strong progress with this important security review topic.

@quidity feel free to chime in #243 to help us move to a direction aligned with your expectations.

anssiko commented 2 years ago

I'm reopening this auto-closed issue to circle back to @quidity to confirm he's happy with the general direction security considerations are taking: https://www.w3.org/TR/webnn/#security (as of writing this gh-pages deploy is not yet complete and the hosted doc not updated yet, but that should be fixed soon)

Our next step is to request security part of wide review tracked in https://github.com/webmachinelearning/webnn/issues/239

quidity commented 2 years ago

Thanks - this extra thinking is helpful! I don't think there's more to do here.

anssiko commented 2 years ago

@quidity thanks for the confirmation!

anssiko commented 2 years ago

Our next step is to request security part of wide review tracked in https://github.com/webmachinelearning/webnn/issues/239

Fyi: Submitted security review request (part of wide review) https://github.com/w3c/security-request/issues/22 and updated https://github.com/webmachinelearning/webnn/issues/239 accordingly.

webmachinelearning / webnn

General Security Questions #241

242

243

244

242

243

244