[Architecture]: Check: Is join() can actually join a new host?

wolf-null commented 3 years ago

The problem

Straightforward joining Nodes to the ProcessingHost during it's execution process is problematic because

ProcessingHost is an isolated process so just changes in MasterHost won't cause any real changes in the real ProcessingHost (not the mirror one)
Changing master host config without appropriate changes in ProcessingHost will cause consistency break (between Master and ProcessingHost)

wolf-null commented 3 years ago

One can kill two birds with the same stone by adding specialized control signals to ProcHost and MasterHost classes, like

"Add the nodes" (of a specified type, with a specified database and, maybe, a pre-cooked input stack) to the specified host.

[ ] Solve the problem of indeterminate behavior of MasterHost at node addition:
- Add node to some host?
- Or add a mirroring node?
- Or shall the mirror node be added automatically?

This "add the node" affects the following problems (in implementation order):

[ ] Nodes hierarchy (in other words: how to tell which node to create since direct class transmission is not recommended because nodes don't know will a signal is to be transferred locally, between processes, or via the network). Requires adding _node_type field to Node's database as a mandatory property.
[ ] Encapsulating node's connections to Node's database as _in_connections, _in_reverse_connections, _out_connections, _out_reverse_connections. This will also require refactoring of the net-building algorithms - is it worth it?
[ ] Service (control) messages of MasterHost and ProcessHost like adding a node or making connections
[ ] Node caching mechanism OR a demand that all data in the Node's database is cacheable. Can't it be both?

wolf-null commented 3 years ago

It looks like that one can create a ProcessHost with some initial nodes And then to add or to delete it

If a ProcessHost is emty then... what?
[ ] Shall it idle at any stage? (Still have to process input data and host-control signals, and, maybe, mirroring)
[ ] Shall the process with no nodes not to exist anymore?
[ ] Shall it be frozen by the MasterHost?

wolf-null commented 3 years ago

join() organizes initialization of ProcessHosts and attaches nodes to it.

There is the following operations included in the join() at the moment:

Generate hostnames for the not-specified ones
Sequently initializes ProcessHost (class, events, buses and the process)
Add nodes to the k'th host. The ProcHost is transmitted to the process due to way of passing the function like host.run()
Add routes to the master host for each node in form of dict: node_name --> prochost_name

This function can operate local MasterHosts only. It will work unpredictably for virtual hosts (network hosts, for instance). This function can be logically decomposed onto two functions:

Initializing a node host
Adding nodes via control signals

The problem is that the way of initializing different subvariants of the Host class might differ. But anyway, it is recommended to create a separate process for each host (whatever it's local, or virtual, or etc). This is due to the universality of the synchronization interface the multiprocessing module can provide. So, actually, not much of a problem:

[ ] Initialize a node host of a selected type
[ ] Add and configure nodes via signals.

Also, it is architecturally more prefered to make Nodes and Hosts to interact right the same way: by receiving control signals and data signals, and by holding all the config in the _data field of the class considered. So all configurations of the Host (like routing) is also stored in the _data field. The problem is that some of these fields (like process handles) are not serializable, not reconstructable, can be cached or passed into another process in an ordinary way.

Need to think on the latter.

This may lead us to consideration nodes as hosts or hosts as nodes. But is it worth it?

One can hold the join() function to implement user-friendly quick host infrastructure building.

wolf-null commented 3 years ago

Signal packaging?

Once one is in need of configuring ProcessHost or transfering dozens of signals from peer A to peer B for another reason, one will face a problem of signal routing overload. This can slow down all signal routing process and, if to run it in async mode, to wipe out other transactions.

In that case, one can propose signal packages: a series of signals from a single A to a single B to be wholly transferred.

There are two ways of packaging:

Serializing and deserializing packages inside nodes. Nodes then are responsible for understanding packages.
In fact, this can be a built-in feature for the Node base class, so the base class deserializes a package

Since there are cases, it would be more flexible to allow nodes to implement or not to implement that feature.

At the moment, Node class doesn't implement any exec() routines.

[ ] The Node can be written so as to deserialize input data automatically when it's emit() function is invoked by the host. If the developer wants to process these signals in he's own way, he can override the emit() function.
- Makes deserialization transparent for a receiver (until one doesn't need to intervene this process - which is also pretty easy)
- But yet have no idea how to make serialization as transparent as deserialization would be.
[ ] One can add special functions to Node base class, and allow end-users (developers) to call or not to call it in their own implementations of Node.
- This one may sound cool but also overcomplicates the Node base class and the development of a user-defined Node subclass
[ ] On the other hand, one can add deserialization tools for the signal itself and let developers to call deserialization from the special serialized signal class.
- A developer is already expected to parse input signals by he's own (receiving buffer and traversing it yet operated manually)
- But if a user identifies a signal as the SerializedSignal (as been mentioned, traversing input signals via "subclass tests" is already standard) it can call it's deserialize() method and push the result to the end of the buffer
- Adding to the end of the buffer will violate signal income order. This might and might not cause trouble. On the other hand, one can process serialized signals "here and now" or something like recursively.
- Forces developers to do the same additional work in the exec() method for all Node inheritors.
- [ ] Check: Will these functions seriously enlarge the size of the signals? (since the number of signals will soon get really large it may become an important problem)
- [ ] From the other hand, one can automate the process: if the developer wants to avoid the additional work he can use NodeSerial (or something like that) as the base class for he's nodes in favor of taking Node as the base class. This "intermediate" class guarantees that all standard serialization signals will be deserialized and passed back to buffer in an expected order. The best way to do it is using emit() function (allows avoid resource-intensive "re-buffering"). This will slow down ProcessHost (since emit() is executed there).
- The deserializaion then is transparent, but serialization isn't. One can implement signal serialization as the dualistic method inside NodeSerial class, for instance, by using *msg argument instead of a single signal. Detais is to be discussed below (next comment).
- If the developer wants to process these signals by he's own, he can use pure Node as the base class (or NodeNonSerial - another inheritor with a guarantee of lack of deserialization mechanisms).

One can implement all these, but this will ruin the standard since the developer doesn't know which method of serialization is passed to the node, there is an ambiguity that is fixed by code overlapping.

The core problem is that the sender node is not really supposed to know is the receiving node is ready to deserialize or not.

Host signal packaging

Essentially different way of packaging is to operate it at the host-host level, so signal serialization is hidden from a node. If there is a task to process input signals in a block there is no big deal: will the Host send n signals to the node OR if the Node will deserialize the package and process right the same amount of nodes as signals.

Deserialization is twice transparent for a Node (compared to previously proposed solutions): there is no need to complicate the Node class or anything else. Deserialized messages came to the destination straight consequently so it's also processed mostly simultaneous.

The serialization problem remains. And there is two solutions (both engages special SerializedSignal class):

[ ] Manual serialization. By adding an additional method for sending messages (or by just building it manually inside the node logic)
[ ] Automated serialization. If a ProcessHost is going to send a series of messages from the same source to the same destination, it unites them in a package

[SOLUTION] But do we really need serialization?

The idea of serialization originates as the problem of adding multiple nodes to the remote host is arisen. This requires the transmission of lots of data messages (imagine transferring a large number of nodes).

But, for this particular case, why not transfer the whole database instead of transferring dozens of data signals?

Well, actually, this is the solution. One can push a big database update:

[ ] Transferring the whole _data with overwriting. Just a SupersetSignal (like the SetSignal)

This decreases the number of messages sent down to the number of nodes spawned.

Maybe this is it...

wolf-null / resource-network-sim-v2