Open jmesnil opened 4 years ago
@ochaloup fiy, this can be reproduced and I have not the same issue with WildFly 17 S2I.
@jmesnil I'm trying to reproduce what you observe. I didn't follow your exact setup as I use the codeready and master
branch of the operator. But still I was not able to reproduce the issue.
From the log I can see that the failures happens on socket dial on the podIP:4712
to execute recovery scan. The call fails after 30s which is timeout time to dial to that ip address : port
. It's just strange that you don't hit the same for the WFLY17.
I will continue in investigation the next day where I'll try to run the exact branch and the minikube.
@jmesnil after some struggle I reproduced the issue and the trouble is that the operator runs on a different network from where the pods run. The operator needs to connect directly to the pod and call to the socket. This is not possible as the minikube runs in the virtual machine and the operator runs on the localhost
. If the network is setup for the localhost
may connect to the IPs (or DNSes) to the virtual machine the socket works.
When the operator and the pods run in the same space - Kubernetes defines the network as flat from what I know - then the dial to the socket works.
This issue should be closed.
Unfortunately currently it's not possible to run operaror locally and process scaledown. What will help is when the https://issues.jboss.org/browse/JBEAP-17611 is done. That will mean that there is only CLI calls which goes over the kubernetes API server and that's the reason (I assume) to be accesible over the network from localhost to the virtualized environment.
why is this issue not happening with WildFly 17 S2I?
@jmesnil because the WFLY17 S2I does not define the recovery-listener
. When the listener is not defined then no recovery is launched. The transactions are left unfinished and scale down proceed. See https://github.com/wildfly/wildfly-operator/pull/75#issuecomment-534076401
@ochaloup This is another good reason to provide a proper management operator for recovery scan....
I do agree and I plan to work on the issue JBEAP-17611 soon ;-)
ok so that means that I'll comment the scale down test for WildFly 18 S2I until it is possible for the operator to issue a recovery scan in WildFly using a management operation (targeting WildFly 19 then)
@jmesnil I don't think it's a good idea. The e2e test should be still working. What does not work is running the operator locally on localhost and the rest as part of minikube. If the operator and the pods are at the same network (as it's usual openshift/kubernetes deployment) then all works fine.
I would really be happy if we can have the scale down test enabled.
Steps to reproduce
make run-local-operator
replicas
to1
The Operator will start recovery but error appears and the pod
quickstart-1
is not terminated: