Open lthibault opened 2 years ago
Another consideration is security. Presently, anchor.Capability
is exported directly at the vat.Network
level, in its own stream handler. This means that it can be arbitrarily bootstrapped by any peer capable of dialing the vat's underlying libp2p Host
, which in turn makes it trivial to escape confinement to a particular anchor subtree.
Obviously this isn't a problem until we actually implement authentication for client capabilities, but it does influence the design decisions for #16. As noted in that issue, per-capability stream handlers may improve performance by taking full advantage of non-blocking QUIC streams. On the other hand, they almost inevitably increase the size of the auth boundary, since each libp2p protocol endpoint must be guarded individually.1 For this reason, I am increasingly opposed to this approach.
To recap, the "pros" for streaming anchor capabilities directly from View
are now:
The "cons" are now:
The current host-anchor implementation introduces a fair bit of complexity that could (in time) be handled by capnp's 3PH.
@aratz-lasa In the short-term, the proposed changes have moderate but direct impact on downstream consumers, so I want to be sure we take the time to discuss this. In particular, I want to make sure we aren't going to undermine any of our existing projects (or that we have good workarounds until 3PH lands).
Problem
Consider the following code.
We can ignore the details of how we are selecting hosts, and what we are doing with them. The essential part is that we performing
doFoo
on only some of the hosts in the cluster. Since we are not performing operations on every host, we do not need to create anAnchor
capability for each host This creates an opportunity for significant optimization because creating anAnchor
capability involves two sub-operations that become costly at scale:O(1)
time, andO(n)
space).Current Solution and Limitations
Rather than stream
Anchor
capabilities back to the client, we stream records that contain routing information for each host in the global view. We then use this routing information to construct a specialAnchor
implementation that lazily dials its remote host when its methods are called for the first time. This is a "perfect" optimization, since it avoids both sub-operations unless the anchor capability is actually being used. It neither creates surplus network connections, nor modifies cap-tables needlessly.The main drawback is that this "lazy-dial" approach involves extra state management. It requires us to
rpc.Conn
after it has been established, and manage its lifecycle; and,Overall, this approach comes at the cost of extra state-management, as well as a mild blurring of system boundaries. It also has the unfortunate effect of placing this extra complexity in high-level code (
pkg/client
rather than, say,pkg/vat/ocap
).Solution
If we are willing to tolerate additional load on the cap table, I think it is possible to both
To do so, the host servicing the the
View
RPC call need only associate anAnchor
capability with each record streamed back to the caller. As before, superfluous network connections are avoided by lazy dialing. Dialing logic is simplified by delegating the construction and lifecycle-management of remote host-anchor capabilities to thecluster.Host
type, whichclient.Node
,rpc.Conn
.Caveats and Mitigation
Cap-Table Contention
As noted above, the proposed solution adds entries to both the sender's and the receiver's cap tables for each record that is transmitted by a call to
Host.View().Iter()
. In the worst-case analysis, this exhibits O(n) complexity both in overall memory usage and heap-object count. Note that the cap tables at both ends of anrpc.Conn
are affected. This problem is however attenuated bysync.Pool
to the use of specialized datastructures in therpc.Conn
cap table.To this first point, we can expect the size of the cap table to stabilize on some asymptotic value for large routing tables as unused capabilities from previous batches are released. The exact value of this asymptote is likely a simple function of batch-size and network RTT. An arbitrary upper bound can therefore be enforced through go-capnp's existing flow-control API.
More generally, full table scans are inherently O(n), so it's expected that applications will try to avoid this by filtering the view on the server-side, in a manner analogous to classical DB queries. To this end, our first line of defense is the enriched query API proposed in #36.
It should lastly be noted that
rpc.Conn
is undergoing heavy development, and that opportunities for improving performance (e.g. through reduced lock contention) almost certain to emerge.Object Proxying and Third-Party Handoff
An important side-effect of the proposed refactoring is that all calls to anchors obtained via the
View
capability will be proxied through its host. In practice, this means proxying through the host to which a givenclient.Node
is connected.This is a perfect target for Cap'n Proto's "Third-Party Handoff" (3PH), which can transparently reduce the network path to a single hop. Level-3 RPC support in go-capnproto is planned, and implementation efforts are estimated to begin in Q1 of 2023.
In the meantime, the main factor to consider is that the proposed solution implies a commitment to 3PH in the medium-term future. The acute need for 3PH will manifest as application-level stability issues due to a single point-of-failure, and to a lesser extent as high latency due to the proxying of RPC calls.