Zato 3 scheduler component seems to not be high availability aware (at least by default)

zatosource / zato

ESB, SOA, REST, APIs and Cloud Integrations in Python

https://zato.io

GNU Affero General Public License v3.0

1.12k stars 240 forks source link

Zato 3 scheduler component seems to not be high availability aware (at least by default) #873

Closed rtrind closed 3 years ago

rtrind commented 6 years ago

My environment has 3 RHEL6.7 machines, each with;

1 server;
1 scheduler;

After deploying a service, creating a channel for it and testing it in single executions, which were working properly, I created on the scheduler tab a call to fire this service once per minute. The problem is all 3 scheduler components are calling the service, which makes each server of each machine to fire the service, which is not the desired effect. I want a single call of the service, like it is on Zato 2.

I'm not sure if the default configuration created by "zato create scheduler" is missing something to make them aware of each other or if I'm not supposed to use them like this (the only change I did to the scheduler.conf was to use the Redis Sentinel in configuration, the cluster section is the same in all 3 schedulers).

My end goal is to achieve high availability by having multiple schedulers active and in case one of them goes south, the environment keeps working, just like Zato 2.

If you need more information to reproduce the scenario let me know.

rtrind commented 6 years ago

EDIT! After restarting the cluster, the service continues to fire 3 times per minute, but the executions are falling on the same server all the time: 20:45 -> machine 1 + 1 + 3; 20:46 -> machine 1 + 1 + 3; 20:47 -> machine 1 + 1 + 2; 20:48 -> machine 1 + 1 + 2; 20:49 -> machine 1 + 1 + 2;

It seems the scheduler behavior is the same, only the server selection on who will run each execution, which is not exactly important.

dsuch commented 6 years ago

The situation in 2.0 was that the failover mechanism could try at times promote a different scheduler thread to the active status even if the primary one was still running. This was a source of confusion and each time required a full restart of all servers.

In 3.0, there can be only one scheduler component for the whole cluster. There is no built-in HA at this point. You should start a single one only. This will surely change but right now it is not available.

rtrind commented 6 years ago

Hmmm, this is unfortunate. Today I have Zato 2 with ODB on a single machine (which is a point of failure for high availability). I was evaluating the stability of the platform since there were some fixes on 2.0.8 which we never were able to make it work and because Zato3 was just around the corner, so I waited for it.

Because of the infamous issues related to the FTP/SFTP blocking access (even using the default zato fs library + ssh2_python on non-blocking mode) sometimes Zato2 gets to a point where the scheduler does not fire anything anymore, which today is circumvented by a simple script at cron which checks if the servers are working properly based on Zato logs and if something goes wrong and does not reestablish itself in 20 minutes, it restarts all zato instances, which makes it work properly.

Without a scheduler being HA aware, I cannot start the scheduler automatically on init.d scripts and will have to control this externally, which is definitely a downgrade at this point in time (the more I have to control Zato architecture by myself, the more chance of errors I have).

Do you have visibility if this is something you are already planning to update on Zato without an sponsor?

dsuch commented 6 years ago

I understand what you are saying and I realize that this is not an ideal and final approach but I believe this is still much better than in 2.0.

In 2.0, the scheduler could sporadically fire tasks twice. This was not deterministic and the only way to get around it was to restart all servers.

In 3.0, it is guaranteed that the scheduler will never execute tasks twice as long as you don't start the scheduler twice. Moreover, if the scheduler for any reason is not available, you can restart this one component only without doing anything with servers.

Also, it exposes a TCP endpoint that you can check - if it replies, it means that the scheduler is up and running. You said that you already have some scripts around it so you could use them to start another instance only if the first is not replying (or you could restart the first one).

I know that this is a different approach but I think it is more reliable because it is deterministic and one has everything under direct control. Most importantly, only the scheduler has to be restarted - in your case, it sounds like you are comfortable with restarting all servers but other people were finding it too much, it required for all their APIs and endpoints to stop in order to restart just one component.

The reason why it does not have true HA yet is that it appears that the only way to guarantee it in a general way is to implement something like the Raft protocol in Zato and make the scheduler use it. This is not a trivial job and there has been no time for it yet. It also has implications like requiring to have an odd number of schedulers in a cluster - many people are happy to have only two components of any kind.

Mid and long-term plans for development until the release in the next year are various enhancements to pub/sub, SFTP, connection plugins and Python 3 support. I would like to implement HA for schedulers but this requires a sponsor.

rtrind commented 6 years ago

Understood. I agree the new architecture with an isolated scheduler seems better, since it could allow me to restore some situations with the restart of only the component. But now I need external (non-zato) tools to make sure a scheduler (and only one scheduler) is up and running without manual intervention, which is critical for my use cases.

I will see how complex it is to make the scheduler HA by using of external tools, even though it reduces how elegant the system is right now.

For me, Zato is a perfect tool by being robust, lightweight at the same time, lightning fast, flexible and HA from the box. The only major thing not elegant enough being the necessity of making the ODB HA which is a hell of a setup to make it working (at least with the more common options, sqlite / mysql / postgresql) but I understand this is not a Zato problem. So my only lasting issue was how unreliable FTP/SFTP is right now in keeping Zato2 working without problems, mitigated by the external restart scripts (which affect my uptime for some endpoints, so this impacts me as well).

By having to control the HA of the scheduler externally it creates the burden of HA on me, which is what I was disappointed to find out.

I ask you to please update the scheduler docs (as of #860) to make it clear for other users of this. I will let you know if I find any simple solutions so you can also mention it at the documentation, not sure how much time I can devote in the next days to work on this.

Thanks for the information anyway.

dsuch commented 6 years ago

I am adding scheduler docs as we speak and I will update the relevant tickets soon.

But there is one matter I do not understand - you said you were already employing scripting to check if there are no tasks executed twice so, in you case, is it all not a matter of re-purposing them to restart the scheduler if need be instead of restarting all of the servers?

As for HA in general - I consider it a beginning of a longer way, perhaps spread across a few releases, to introduce a Raft-like algorithm for deciding when a given component is down or not. Schedulers looks like the great first application of it. There are also a few parts in pub/sub that could benefit from it.

dsuch commented 6 years ago

As for external tools - here is a sample configuration file for supervisord. You can start your scheduler under supervisord and it will automatically monitor if it is up or not and restart it accordingly:

https://github.com/zatosource/zato-build/blob/master/docker/quickstart/supervisord.conf

rtrind commented 6 years ago

But there is one matter I do not understand - you said you were already employing scripting to check if there are no tasks executed twice so, in you case, is it all not a matter of re-purposing them to restart the scheduler if need be instead of restarting all of the servers?

My script does not check for duplicate firing (I don't even have this issue). It looks at the singleton.log to see if everything is working normally. Usually when the scheduler goes dead (caused by zato fs 0.4.0 FTP blocking calls) the singleton log also reflects this (by not showing the keepalive lines periodically). Based on this, I restart the servers on all 3 machines, which is undesired, because I lose several activations of my jobs during this period, which generates a backlog on my processing queue (which is time sensitive).

In Zato 3 it's a different thing. It's not a matter of identifying if the scheduler is up or not and restarting, but an architectural challenge in selecting in which machine should I make the scheduler up or not. I will have to:

Remove all 3 schedulers from /etc/zato/components-enabled (so they don't start automatically by the system);
Start one of the 3 schedulers automatically (and only one);
If the machine with the scheduler goes down, start the scheduler on another machine (extra challenge by quorum voting between 3 nodes);
If the original machine comes back, do not start the scheduler on it anymore (since another one is already active);
(optionally) If the scheduler crashes, restart it on another machine (which could be the same one where it was before).

This has nothing to do with the current mechanism I have so I cannot reuse it.

In other words, the problem is not being able to restart a scheduler in a single machine while it's alive and available. It's the component being able to survive a machine dump, app crash, shutdown or network isolation and still have one scheduler always up and available (without manual intervention). Zato 2 today attends my criteria for this (with the hard restart script which runs usually once or twice a day).

Hope it's clearer now.

aek commented 6 years ago

@rtrind maybe you should use another scheduler for the distributed HA job execution since as @dsuch states that Zato doesn't support right now a consensus algorithm.

This has been discussed before here: https://forum.zato.io/t/migrated-network-addresses/571/10 https://forum.zato.io/t/scheduler-fires-same-service-twice/1215/7

rtrind commented 6 years ago

Small update...

I noticed we already use keepalived on the machines zato resides on our environment, so I replicated the configuration and with small adjustments I was able to setup it to start the scheduler component in a single machine at a time.

This is far from ideal, since I could not find any way for it to health check the scheduler being up and starting it again (or in another node). But at least if one machine becomes network isolated or goes down completely, the service should be migrated properly to another one automatically, which is light years ahead of nothing. This effectively unblocks me from upgrading right now, so I will continue testing Zato 3 in my lab environment.

I suggest to keep the issue open for future prioritization of this development, whenever it gets discussed again.

dsuch commented 6 years ago

Hello @rtrind - keepalived has option TCP_CHECK - can you use it to establish if the load-balancer is still running and then act accordingly?

In scheduler.conf there is this stanza ..

[bind]
host=0.0.0.0
port=31530

.. so you would be able to connect to it on port 31350 by default.

rtrind commented 6 years ago

I already had tried this, without much success... (maybe I'm doing this wrong)

Attempt 1 Server A weight - 90 Server B weight - 80 Server C weight - 70 notifymaster starts scheduler notify[everything else] stops scheduler On health check down, -30 to weight. Results: Only the master starts the scheduler. B and C continue to check, but go down in priority. If i shutdown A scheduler, it continues to have higher priority, never gives up master, scheduler is now down on the cluster.

Attempt 2 Same config as above except: No weight on HC down, which moves machine to state fault. Results: All machines start on fault event, so no scheduler is ever started. _initfail parameter does not change anything.

I tried more combinations without much success. Unless I'm missing some magical configuration, I would need to use attempt 1 strategy with a more robust HC, which is smart enough to know if the current machine is the master and only returns true fail there (always success on the secondary machines). Then using preempt it won't try to immediately move the service back to the first machine, which is probably a good idea.

If my keepalived used a floating IP this would be easier, but we are really tight in free ip addresses in this subnetwork, I cannot freely use one for this in my production. If I figure out another way to check this, maybe I can achieve an almost perfect solution (except for needing an external tool to do it, but understandable considering the other priorities right now), but it's a holiday tomorrow, will try more options next week.

Thanks anyway!

rtrind commented 6 years ago

I managed to make something along those lines working. Basically:

Same configuration from attempt 2;
Added "fail 6" "rise 6" "interval 10" and "nopreempt" on the HC, which consists of;
- If the machine is master, check if the port 31530 is listening. Fail if it's not (exit 1);
- If the machine is not master, always return success (exit 0);
Notify scripts to feed the file used by the HC and start/stop the scheduler in proper transition states.

In this case, all machines start with fail, but since no machine is master, the HC will return them to BACKUP state in a minute. When this happens, one of the machines takes over as master and starts the scheduler. After one or 2 fails on the HC (not enough to trigger the fail state, until the port is listening as expected) the HC returns OK and keeps it this way.

This covers some scenarios:

Machine going down;
Scheduler crashing;

This kinda covers:

Machine network isolation;

There is still the split brain scenario, where for some reason the machine are isolated from each other (from a network perspective) and they all could think they should be master, but since in this case the ODB won't be available as R/W on all 3 machines at the same time, I believe the activations on such machines will always fail, which is acceptable.

So, in the end, it's not super elegant, it's dependent on configuration by the deployer but at least it's something which allows me to continue evaluating Zato 3 for my environment.