Closed MarkAckert closed 4 months ago
Some notes from today's meeting: does it require mediation layer: I think so
does it require SSO: I think so
does it require sysplex? no, there seem to be 2 technologies that help
scalability: Discuss ways to allow scaling with HA, but know that they aren't the same thing and HA is probably more important to the use case of Zowe (hundreds of users per instance, not millions)
database as state-holder: does this solve problems or create them?
can websocket be HA: it depends. use socket.io or reconnecting websocket and store state client-side whenever possible.
does it require scripting/instance updates? yes - there is currently no way to start multiple servers per instance. ports are an issue. knowing how many are started is an issue. how the logs are formatted is an issue.
do we have resources to test? unsure. can marist or river handle the extra load of multiple servers? it sounds like they are not configured for sysplex, so that is an issue.
certificates: how do we handle keystores... does it change at all?
metrics: what numbers do we aim for with regards to
logs: same file, separate files, formatting within file?
upgrades: there is an idea to try to update instance and then shut down & restart servers in instance one-by-one such that there is actually no downtime. sounds out of scope for this PI though.
does 3rd party software get HA for free when zowe gets HA: Sometimes. Yes in 3 cases:
tiers of HA tier 1 cold-backup: parent agent that restarts children when they stop. there is off-the-shelf technology for this already. hot-backup: this is where a lot of the questions are coming from, but cold-backup could be done first to make some people happy.
Some thoughts I had around the meeting:
The service is … the Zowe servers and attendant processes on z/OS.
The key to implementing a high availability system is to identify and eliminate
single points of failure through redundancies, clustering, and failover
mechanisms.
Thanks for these notes, @1000TurquoisePogs @John-A-Davies especially towards those, like myself, who know very little about what HA means
The following article is the expression of our vision for a highly available API Mediation Layer (Apiml) and Zowe.
We used to have a highly available distribution of Apiml in the past, before it became a Zowe component. When we’re thinking about the future Apiml HA story, we are referring to past experiences and designs. We are looking forward to integrating our vision with the broader Zowe forum.
The client consuming Zowe services.
To provide clients with reliable service that is resilient to individual service failure. Quoting John Davies:
The key to implementing a high availability system is to identify and eliminate single points of failure through redundancies, clustering, and failover mechanisms.
Client has a single address and a port to call. This is the address of Load balancer. In our case, the load balancing was done by DVIPA, but other solutions can be employed. We can see a L4 load balancer like DVIPA, linux IPVS or a hardware solution in place. Health checking of Gateway instances is desirable. Gateway exposes a rest endpoint for this purpose.
Gateway Service is the de-facto L7 load balancer. It knows about registered services and provides load balancing of requests in round robin mode. There is a basic failover mechanism for retrying requests when service doesn't respond. At this time, gateway is stateless. These behaviors can be customized or extended. Such extensions should be carefully considered. (for instance adding “sticky sessions” might disable failover and introduce state into the cluster)
Discovery service instances are clustered together and distribute information about registered services between themselves. Discovery service instances need to know about each other. This is the place where the cluster is connected. This implies that either the cluster is defined at install time (static: zos install) or we come up with a way to distribute the config (dynamic: container). The discovery services share information about registered services.
Individual services need to know about and register into a single discovery instance. High availability is automatically achieved when more than one instance of a service is registered.
Based on the discussion yesterday, here is a bit more on the topic of the state and using the MQs to handle state:
There are multiple options with respect to handling the state which needs to be persistent. One of the solutions is of course the database. Another solution is to publish shared state via one of the Messaging Queue solutions.
Using the database to store the state has its issues mainly with respect to operational and security concerns and also achieving HA for the DB may be operationally difficult.
The limitation for MQ implementation is that it needs to run on the zOS, which limits some of the popular Open Source solutions such as Rabbit MQ (Written in Erlang) nevertheless it doesn’t remove the option of using Active MQ or IBM MQ.
There are different modes of running the MQ solution. One of them is to keep the information only in the memory, which may turn out to be persistent enough as it is shared between the instances. Another one is to store the state to the filesystem allowing the restoration of the last known state even when the server goes down as long as the filesystem isn’t damaged. Depending on the specific implementations there may be other persistent solutions such as using the database or some cache mechanism (Memcached, Redis) to persist the data.
Embedded to the Gateway
The current architecture of the Gateway built upon the Spring Cloud gives us a simple option to set up embedded Active MQ on each of the instances of the Gateway and set the instances to know about the others and share the state as if the MQ was deployed in a cluster.
High Availability Research:
Understand the problem space for high availability in Zowe - what components need to change in order to start implementing high availability deployments, or what Zowe install / configuration development may be required to support it
Have a shared understanding of High Availability - is it reliability, scale, and where is it - sysplex, multi-sysplex, containers?
Begin researching solutions for the problem space for a given platform once we have a better understanding / goal in mind.