youdowell / k8s-galera-init

Example of MariaDB Galera init container for Kubernetes StatefulSet
8 stars 1 forks source link

cluster refuse to start after crash #3

Open seecsea opened 7 years ago

seecsea commented 7 years ago

Hi,I rm some Exited container on k8s node to clean docker ps,so galera pod status changed to Init:0/1,not Running.And then,I delete -f mysql.yaml,all pods cleaned,and create -f mysql.yaml,the pod can not start: NAME READY STATUS RESTARTS AGE IP NODE mysql-0 0/1 CrashLoopBackOff 5 4m 172.30.6.17

I do not delete pvc and pv(via ceph RBD StorageClass),and ConfigMap,secrect etc.

the logs: 2017-06-16 14:19:08 140438696024000 [Note] mysqld (mysqld 10.1.24-MariaDB-1~jessie) starting as process 1 ... 2017-06-16 14:19:08 140438696024000 [Note] WSREP: Read nil XID from storage engines, skipping position init 2017-06-16 14:19:08 140438696024000 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so' 2017-06-16 14:19:08 140438696024000 [Note] WSREP: wsrep_load(): Galera 25.3.20(r3703) by Codership Oy info@codership.com loaded successfully. 2017-06-16 14:19:08 140438696024000 [Note] WSREP: CRC-32C: using "slicing-by-8" algorithm. 2017-06-16 14:19:08 140438696024000 [Note] WSREP: Found saved state: 4bb73083-4b4a-11e7-a4c7-fbb547b972fa:-1, safe_to_bootsrap: 0 2017-06-16 14:19:08 140438696024000 [Note] WSREP: Passing config to GCS: base_dir = /var/lib/mysql/; base_host = mysql-0.mysql.seecsea.svc.cluster.local; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.recover = no; gcache.size = 128M; gcomm.thread_prio = ; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S 2017-06-16 14:19:08 140438696024000 [Note] WSREP: GCache history reset: old(4bb73083-4b4a-11e7-a4c7-fbb547b972fa:0) -> new(4bb73083-4b4a-11e7-a4c7-fbb547b972fa:-1) 2017-06-16 14:19:08 140438696024000 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1 2017-06-16 14:19:08 140438696024000 [Note] WSREP: wsrep_sst_grab() 2017-06-16 14:19:08 140438696024000 [Note] WSREP: Start replication 2017-06-16 14:19:08 140438696024000 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1 2017-06-16 14:19:08 140438696024000 [ERROR] WSREP: It may not be safe to bootstrap the cluster from this node. It was not the last one to leave the cluster and may not contain all the updates. To force cluster bootstrap with this node, edit the grastate.dat file manually and set safe_to_bootstrap to 1 . 2017-06-16 14:19:08 140438696024000 [ERROR] WSREP: wsrep::connect(gcomm://) failed: 7 2017-06-16 14:19:08 140438696024000 [ERROR] Aborting

but Init container has the ENV such as: { "name": "SAFE_TO_BOOTSTRAP", "value": "1" }, does it not work correctly?

seecsea commented 7 years ago

Oh,I am Sorry,I fixed it by added the line "subPath": "mysql" to Init container just like you said in https://github.com/youdowell/k8s-galera-init/issues/2#issuecomment-307033861

ausov commented 7 years ago

Ok, good that you found a solution, thank you for sharing!

Please note: total cluster crash when all nodes go down, there's a chance that some most recent updates are lost during restart with SAFE_TO_BOOTSTRAP option turned on!

This may happen:

  1. nodes 1,2,3 running
  2. last updated data (D) on node 3
  3. all nodes crash suddenly so that node 3 did not have time to sync data to node 1
  4. nodes restarted: 1,2,3
    • node 1 starts
    • node does not have the last update so it would normally refuse to start
    • since SAFE_TO_BOOTSTRAP option turned on, it will start anyway
    • node 2 starts and takes last updates from node 1
    • node 3 starts and takes last updates from node 1
  5. last updated data (D) on node 3 is DISCARDED!

Read more: https://www.percona.com/blog/2014/09/01/galera-replication-how-to-recover-a-pxc-cluster (See section "Scenario 6" - All nodes went down without proper shutdown procedure)

The standard advice for reliable recovery is to not to use SAFE_TO_BOOTSTRAP but manually analyse which node has the latest data and start that node first. This might be not so easy with a K8s Statefulset because it always starts from a first node. The solution might be to start all nodes in parallel and launch some network sync script before starting mysql service. The script would identify the latest node and start it first.

seecsea commented 7 years ago

the problem is still exist: Today,my all servers lost power。after all servers come on again,the MySQL galera cluster goes into Error or CrashLoopBackOff status。the logs is the same with above。 Then,I delete -f mysql.yaml and create -f mysql.yaml,all data come back OK。(powered by Storageclass with ceph rbd)the cluster is all OK。 And,I reboot all nodes server (three nodes in test lab),the MySQL galera cluster goes into Error or CrashLoopBackOff status again。every times is the same result。 How can I do to fix it?

ausov commented 7 years ago

AFIAIK, the safest way to migrate your cluster is not to delete all at once but to scale down with some delay to 1 node. Like this the first node will have all changes propagated from other nodes. Then restarting the only node and scaling up again.

Loosing all nodes at once should be a very rare disaster in a cluster environment and normally will need some manual restoration of as explained in the above link.

Also, normally you do not want to delete -f mysql.yaml. You should not delete the stateful set and it's resources but rather using other commands:

You can try this to scale down (note, if your cluster is in bad shape, you may loose recent changes on some nodes):

kubectl scale mysql --replicas=1

Check that all nodes shut down:

kubectl get po -l app=mysql

Then wait until all nodes except 1st are properly shut down and the first node starts without errors. Then scale up:

kubectl scale mysql --replicas=3

What the log says? What's the source of error causing the "CrashLoopBackOff"?

seecsea commented 7 years ago

kubectl patch statefulset mysql -p '{"spec":{"replicas":1}}' -n seecsea and kubectl scale statefulset mysql --replicas=1 -n seecsea these commands have not effect the Error cluster,the 3 replicas of statful are still there。 I cannot scale down the Error cluster。

I exec into mysql-0 and sed -i change the config file: sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/g' /var/lib/mysql/grastate.dat when the pod mysql-0 stat changes to 0/1 running. and the error logs changed: 2017-07-25 17:28:00 139894018029504 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out) at gcomm/src/pc.cpp:connect():158 2017-07-25 17:28:00 139894018029504 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out) 2017-07-25 17:28:00 139894018029504 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1404: Failed to open channel 'mysql' at 'gcomm://mysql-1.mysql.seecsea.svc.cluster.local,mysql-2.mysql.seecsea.svc.cluster.local': -110 (Connection timed out) 2017-07-25 17:28:00 139894018029504 [ERROR] WSREP: gcs connect failed: Connection timed out 2017-07-25 17:28:00 139894018029504 [ERROR] WSREP: wsrep::connect(gcomm://mysql-1.mysql.seecsea.svc.cluster.local,mysql-2.mysql.seecsea.svc.cluster.local) failed: 7 2017-07-25 17:28:00 139894018029504 [ERROR] Aborting

How about fix(no if checks) the config file to :safe_to_bootstrap: 1 ? So,modify the init container's start sh:

!/bin/bash

[ "$DEBUG" = "1" ] && set -x

GALERA_CONFIG=${GALERA_CONFIG:-"/etc/mysql/conf.d/galera.cnf"} DATA_DIR=${DATA_DIR:-"/var/lib/mysql"} HOSTNAME=$(hostname)

Pod hostname in k8s StatefulSet is formatted as: "statefulset_name-index"

CLUSTER_NAME=${CLUSTER_NAME:-${HOSTNAME%%-*}} sed -i 's/safe_to_bootstrap: 0/safe_to_bootstrap: 1/g' "$DATA_DIR/grastate.dat" ....

ausov commented 7 years ago

You can just kubectl delete po -l app=mysql. This will stop the nodes, scale down to 1 and then start them again from the first one.

If you also set env variable SAFE_TO_BOOTSTRAP to 1 to accept possible data loss. Then the cluster will force to start.

Second way is to manually set safe_to_bootstrap: 1 in the grastate.dat on the node that has the latest dataset and run that node first (run mysql --wsrep-new-cluster). Wait until other nodes started and joined, then restart the node.


Generally, It seems that full recover of Galera cluster deployed as simple K8s StatefulSet (without some sort of arbitrator service) is not an easy exercise. Maybe you should think about switching to master-slave deployment, especially if your environment may experience a full crash. This can be also a K8s StatefulSet where mysql-0 would be always a master.

seecsea commented 7 years ago

sorry for my late reply. this bug is due to https://github.com/kubernetes/kubernetes/issues/36485,the init container not work after the node reboot(not modify the safe_to_bootstrap to 1).maybe fixed until release 1.8.