Closed thekid closed 1 year ago
Initial tests show everything works as expected when switching connections. Added this debugging code:
diff --git a/src/main/php/com/mongodb/io/Connection.class.php b/src/main/php/com/mongodb/io/Connection.class.php
index e6fbaa3..8d2563c 100755
--- a/src/main/php/com/mongodb/io/Connection.class.php
+++ b/src/main/php/com/mongodb/io/Connection.class.php
@@ -203,6 +203,14 @@ class Connection {
$r= $this->send(self::OP_MSG, "\x00\x00\x00\x00\x00", $sections);
if (1 === (int)$r['body']['ok']) return $r;
+\util\cmd\Console::writeLine(
+ '[MONGODB] Error ',
+ $this->address() === $this->server['primary'] ? 'primary::' : 'secondary::',
+ $this->address(),
+ ' -> ',
+ $r
+);
+
throw Error::newInstance($r['body']);
}
diff --git a/src/main/php/com/mongodb/io/Protocol.class.php b/src/main/php/com/mongodb/io/Protocol.class.php
index 9f66569..5fe671e 100755
--- a/src/main/php/com/mongodb/io/Protocol.class.php
+++ b/src/main/php/com/mongodb/io/Protocol.class.php
@@ -194,6 +194,13 @@ class Protocol {
foreach ($conn->server['hosts'] ?? [] as $host) {
if ($conn->server['primary'] !== $host) $this->nodes['secondary'][]= $host;
}
+
+\util\cmd\Console::writeLine(
+ '[MONGODB] Connected to ',
+ $candidate === $conn->server['primary'] ? 'primary::' : 'secondary::',
+ $candidate
+);
+
shuffle($this->nodes['secondary']);
} else if ($time - $conn->lastUsed >= $this->socketCheckInterval) {
try {
When switching connections, the following appear correctly in the console before issuing the write command:
[MONGODB] Connected to secondary::ac-vkfrdmv-shard-00-00.abc1234.mongodb.net:27017
[MONGODB] Connected to primary::ac-vkfrdmv-shard-00-02.abc1234.mongodb.net:27017
The first is from the initial connect, the second when a write context is established.
More testing doesn't show an immediate bug in the implementation and suggests that a primary re-election on the server side may lead to this situation:
Event | Client view | Server side situation |
---|---|---|
Client: Before | (n/a) | {Primary: A, Secondary: [B, C]} |
Client: Connect | {Primary: A, Secondary: [B, C]} | {Primary: A, Secondary: [B, C]} |
Remote: Primary change | {Primary: A, Secondary: [B, C]} | {Primary: B, Secondary: [A, C]} |
Client: Write | {Primary: A, Secondary: [B, C]} | {Primary: B, Secondary: [A, C]} |
The client has an inconsistent view of the server side situation and uses node A
for writing which it believes to be the primary, while in reality it's changed its role to a secondary.
An idea could be catch this specific error (any of the error codes categorized as "NotPrimaryError" here) and refresh the view into the cluster before retrying the write command.
Occurred when issuing a write command after a while of the process being idle. Seems reconnect handling does not work reliably.