Open tarnfeld opened 10 years ago
Implemented at a835b126a.
Just giving this a go now on a staging cluster, actually. I'll close it if it seems to work fine.
In general it seems to work OK (the zookeeper group aspect) but I think my tested around ZK also falls down with #15. In the event of an identical appointment, presumably the code that continues to re-connect to the known master should kick in? I think that bit is broken.
2015-03-28 03:21:01,841[pesos.detector] FutureMasterDetector.detect no-op because previous same as leader: None
2015-03-28 03:21:01,843[pesos.detector] FutureMasterDetector.appoint accepting appointment master@192.168.33.2:5050
2015-03-28 03:21:01,843[pesos.scheduler] New master detected: master@192.168.33.2:5050
2015-03-28 03:21:01,843[pesos.scheduler] Registering framework: framework {
user: "tom"
name: "xxx"
hostname: "1.0.0.127.in-addr.arpa"
}
2015-03-28 03:21:01,844[pesos.scheduler] Setting transition watch from previous master: master@192.168.33.2:5050
2015-03-28 03:21:01,844[pesos.detector] FutureMasterDetector.detect no-op because previous same as leader: master@192.168.33.2:5050
2015-03-28 03:21:01,919[x.scheduler] Framework 20150328-031924-35760320-5050-1308-0000 registered to http://vagrant-ubuntu-trusty-64:5050
2015-03-28 03:21:01,961[x.scheduler] Handling 1 offers
2015-03-28 03:21:03,844[pesos.scheduler] Skipping registration because we are either connected or there is no appointed master.
2015-03-28 03:21:07,354[x.scheduler] Handling 1 offers
2015-03-28 03:21:13,358[x.scheduler] Handling 1 offers
2015-03-28 03:21:19,362[x.scheduler] Handling 1 offers
2015-03-28 03:21:23,611[compactor.context] Received disconnection from master@192.168.33.2:5050 but no stream found.
2015-03-28 03:21:30,659[pesos.detector] FutureMasterDetector.appoint skipping identical appointment master@192.168.33.2:5050
2015-03-28 03:21:34,061[pesos.detector] FutureMasterDetector.appoint skipping identical appointment master@192.168.33.2:5050
thanks for the report. I'll take a closer look.
Simply removing the check from here seems to do the trick, but I don't think that's the real solution.
Edit: I also added the following method to the scheduler;
def exited(self, pid):
if pid == self.master:
log.info('Disconnected from current master: %s' % pid)
self.context.delay(self.MASTER_DETECTION_RETRY_SECONDS, self.pid, 'detect')
Just getting to grips with things, but I assume it's just a case of implementing one of those in
pesos.detector
?