Epic: High Availability

MarkAckert commented 4 years ago

High Availability Research:

Understand the problem space for high availability in Zowe - what components need to change in order to start implementing high availability deployments, or what Zowe install / configuration development may be required to support it
Have a shared understanding of High Availability - is it reliability, scale, and where is it - sysplex, multi-sysplex, containers?
Begin researching solutions for the problem space for a given platform once we have a better understanding / goal in mind.

1000TurquoisePogs commented 4 years ago

Some notes from today's meeting: does it require mediation layer: I think so

does it require SSO: I think so

does it require sysplex? no, there seem to be 2 technologies that help

network attached storage (sysplex storage)
a port virtualizer (sysplex distributor, dvipa) These features are found in a sysplex, but equivalent software may be found on linux, so we should not rule out possibility for people to have HA when using Docker, yet can still fully rely upon syplex when on z/os

scalability: Discuss ways to allow scaling with HA, but know that they aren't the same thing and HA is probably more important to the use case of Zowe (hundreds of users per instance, not millions)

database as state-holder: does this solve problems or create them?

can websocket be HA: it depends. use socket.io or reconnecting websocket and store state client-side whenever possible.

does it require scripting/instance updates? yes - there is currently no way to start multiple servers per instance. ports are an issue. knowing how many are started is an issue. how the logs are formatted is an issue.

do we have resources to test? unsure. can marist or river handle the extra load of multiple servers? it sounds like they are not configured for sysplex, so that is an issue.

certificates: how do we handle keystores... does it change at all?

metrics: what numbers do we aim for with regards to

time to failover
latency difference vs non-ha
idle resources vs non-ha
state-change resources vs non-ha

logs: same file, separate files, formatting within file?

upgrades: there is an idea to try to update instance and then shut down & restart servers in instance one-by-one such that there is actually no downtime. sounds out of scope for this PI though.

does 3rd party software get HA for free when zowe gets HA: Sometimes. Yes in 3 cases:

Is a cli plugin that only relies upon cli core or other plugins that are already HA-capable
Is an apiml conformant server and stateless
Is an app fw app which does not include an extra server (if extra server, see 2)

tiers of HA tier 1 cold-backup: parent agent that restarts children when they stop. there is off-the-shelf technology for this already. hot-backup: this is where a lot of the questions are coming from, but cold-backup could be done first to make some people happy.

John-A-Davies commented 4 years ago

Some thoughts I had around the meeting:

What is HA? What is not HA?

Enhanced availability of a service.
Avoiding interruption or prohibitive slowdown of a service.
HA does not mean continuous, which would mean no downtime, either scheduled or un-scheduled.

The service is … the Zowe servers and attendant processes on z/OS.
The key to implementing a high availability system is to identify and eliminate single points of failure through redundancies, clustering, and failover mechanisms.

Methods of solution are to automatically

Reinstate a crashed server by restarting it
Switch a client request stream over to a hot-standby server
Balance the load over several parallel servers, moving it away from servers that have slowed or stopped
Write to regional persistent disks, (like Google does) which synchronously replicate data at the block-level between two zones in a region
Startup/shutdown servers based on load
Disaster recovery Avoiding false negatives:
Avoid overloading the O/S by blindly restarting servers which immediately just crash again, because the original crash was due to system resource constraint instead of an exceptional condition.
Order of attack of work items? Based on 3 tiers:
Easy, basic, well-understood, expected solutions e.g. restart a crashed server
Harder-to-do items e.g. maintaining state for a seamless failover without loss of connection
Very hard to do, e.g. geographically-separated seamless failover and auto-start/shutdown of servers based on load

DivergentEuropeans commented 4 years ago

Thanks for these notes, @1000TurquoisePogs @John-A-Davies especially towards those, like myself, who know very little about what HA means

jandadav commented 4 years ago

The following article is the expression of our vision for a highly available API Mediation Layer (Apiml) and Zowe.

How did we arrive to this

We used to have a highly available distribution of Apiml in the past, before it became a Zowe component. When we’re thinking about the future Apiml HA story, we are referring to past experiences and designs. We are looking forward to integrating our vision with the broader Zowe forum.

Who is the target persona

The client consuming Zowe services.

What is the main motivation for this vision

To provide clients with reliable service that is resilient to individual service failure. Quoting John Davies:

The key to implementing a high availability system is to identify and eliminate single points of failure through redundancies, clustering, and failover mechanisms.

How does this setup achieve the objective

Zowe HA

Load balancer tier

Client has a single address and a port to call. This is the address of Load balancer. In our case, the load balancing was done by DVIPA, but other solutions can be employed. We can see a L4 load balancer like DVIPA, linux IPVS or a hardware solution in place. Health checking of Gateway instances is desirable. Gateway exposes a rest endpoint for this purpose.

Apiml tier

The Gateway

Gateway Service is the de-facto L7 load balancer. It knows about registered services and provides load balancing of requests in round robin mode. There is a basic failover mechanism for retrying requests when service doesn't respond. At this time, gateway is stateless. These behaviors can be customized or extended. Such extensions should be carefully considered. (for instance adding “sticky sessions” might disable failover and introduce state into the cluster)

The Discovery

Discovery service instances are clustered together and distribute information about registered services between themselves. Discovery service instances need to know about each other. This is the place where the cluster is connected. This implies that either the cluster is defined at install time (static: zos install) or we come up with a way to distribute the config (dynamic: container). The discovery services share information about registered services.

Service tier

Individual services need to know about and register into a single discovery instance. High availability is automatically achieved when more than one instance of a service is registered.

Our point of view on some of the topics being debated

Xxx as state holder: Apiml uses an internal cache which has a distribution option. We expect to store and distribute our state using this or similar solution.
can websocket be HA: We need to do research
Certificates: We don’t think that using SAF Keyring is an option on containers.
Failover: Backup (hot and cold) and automatic service restarts are debated. From our point of view, this is operations. We are concerned about service registration, status and how to balance traffic between service instances. We provide health check endpoints on Apiml services. How those instances are managed we see out of our scope.
does 3rd party software get HA for free when zowe gets HA: Not sure we understand the ZAF case, but for the routed services, you can say yes, if there are more than one instance running.
z/OSMF HA: From this perspective, we treat z/OSMF as any other service we’re routing. If there are two service instances, the requests will be balanced with retry on failure. Clients or CLI need just to call z/OSMF through Gateway and they will get the benefits.

Unknowns and Questions

Is Cold backup enough?
We need to know what the users expect of HA to be able to target the right use case. We have assumed service reliability but that is just an assumption. We will probably have more with time

balhar-jakub commented 4 years ago

Based on the discussion yesterday, here is a bit more on the topic of the state and using the MQs to handle state:

State handling

There are multiple options with respect to handling the state which needs to be persistent. One of the solutions is of course the database. Another solution is to publish shared state via one of the Messaging Queue solutions.

Using the database to store the state has its issues mainly with respect to operational and security concerns and also achieving HA for the DB may be operationally difficult.

The limitation for MQ implementation is that it needs to run on the zOS, which limits some of the popular Open Source solutions such as Rabbit MQ (Written in Erlang) nevertheless it doesn’t remove the option of using Active MQ or IBM MQ.

There are different modes of running the MQ solution. One of them is to keep the information only in the memory, which may turn out to be persistent enough as it is shared between the instances. Another one is to store the state to the filesystem allowing the restoration of the last known state even when the server goes down as long as the filesystem isn’t damaged. Depending on the specific implementations there may be other persistent solutions such as using the database or some cache mechanism (Memcached, Redis) to persist the data.

Embedded to the Gateway

The current architecture of the Gateway built upon the Spring Cloud gives us a simple option to set up embedded Active MQ on each of the instances of the Gateway and set the instances to know about the others and share the state as if the MQ was deployed in a cluster.

zowe / zowe-install-packaging