Adding a recovery mechanism for a split gossip cluster#2134
Adding a recovery mechanism for a split gossip cluster#2134fcrisciani merged 1 commit intomoby:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2134 +/- ##
=========================================
Coverage ? 40.49%
=========================================
Files ? 139
Lines ? 22467
Branches ? 0
=========================================
Hits ? 9097
Misses ? 12031
Partials ? 1339
Continue to review full report at Codecov.
|
networkdb/delegate.go
Outdated
| func (d *delegate) LocalState(join bool) []byte { | ||
| if join { | ||
| // Update all the local node/network state to a new time to | ||
| // Update all the local node/network state to a new time to |
There was a problem hiding this comment.
How this happened... Fat finger :)
I'll fix it
networkdb/cluster.go
Outdated
| if _, err := mlist.Join(members); err != nil { | ||
| // In case of failure, keep retrying join until it succeeds or the cluster is shutdown. | ||
| go nDB.retryJoin(members, nDB.stopCh) | ||
| go nDB.retryJoin(nDB.ctx, members) |
There was a problem hiding this comment.
with this change I don't think launching this routine makes anymore sense. If there is a failure here, the next attempt will happen in 60s or less with the other codepath
There was a problem hiding this comment.
Good point, clusterJoin is only called when agent is initialized to join the boot_strap nodes; unless it's time critical and we can't wait 60 seconds, it's now redundant.
Happy to take it out.
networkdb/cluster.go
Outdated
| ctx, cancel := context.WithTimeout(nDB.ctx, 10*time.Second) | ||
| defer cancel() | ||
| nDB.retryJoin(ctx, bootStrapIPs) | ||
|
|
networkdb/cluster.go
Outdated
| nDB.RUnlock() | ||
| logrus.Debugf("rejoinClusterBootStrap, calling cluster join with bootStrap %v", bootStrapIPs) | ||
| // All bootStrap nodes are not in the cluster, call memberlist join | ||
| ctx, cancel := context.WithTimeout(nDB.ctx, 10*time.Second) |
There was a problem hiding this comment.
we should put this time of 10s at the top
| } | ||
| // If the node is not known from memberlist we cannot process save any state of it else if it actually | ||
| // dies we won't receive any notification and we will remain stuck with it | ||
| if _, ok := nDB.nodes[nEvent.NodeName]; !ok { |
There was a problem hiding this comment.
Is the nDB.findNode check necessary above [L28] given this check?
There was a problem hiding this comment.
only to filter based on the lamport clock
| // rejoinClusterBootStrap is called periodically to check if all bootStrap nodes are active in the cluster, | ||
| // if not, call the cluster join to merge 2 separate clusters that are formed when all managers | ||
| // stopped/started at the same time | ||
| func (nDB *NetworkDB) rejoinClusterBootStrap() { |
There was a problem hiding this comment.
Will it make sense to also attempt to refresh nDB.bootStrapIP here through a call to something like GetRemoteAddressList in case the list has changed?
There was a problem hiding this comment.
@fcrisciani and I had a discussion about other use cases that can lead us to the same "split cluster" and the possibility of re-checking/updating the bootStrap IPs (through GetRemoteAddressList)
1- Re-ip the managers .
2- Demote/Promote managers/workers .
The first one is not an issue as a re-ip to all managers will force all nodes in the cluster to restart .
As for 2), customers will only hit it if they demoted all managers in the cluster + restarting those managers without restarting the workers... We felt like this is a very edge case.
This has been said, If you guys think we should still refresh the bootStrapIP, then we can add the logic to the newly introduced rejoinClusterBootStrap
There was a problem hiding this comment.
I think the 2 problems can be handled separately. Memberist with this PR will honor the bootstrapIP that got passed at the beginning.
The second issue will need a periodic check of the GetRemoteAddressList and if the list change, the routine have to call the networkDB.Join with the new bootstrapIPs
0a10a03 to
e0dd4ed
Compare
networkdb/cluster.go
Outdated
| bootStrapIPs = append(bootStrapIPs, bootIP.String()) | ||
| } | ||
| nDB.RUnlock() | ||
| // Not all bootStrap nodes are in the cluster, call memberlist join |
There was a problem hiding this comment.
I think you mean None of the bootStrap nodes are in the cluster rather than Not all ... in the comment, right?
There was a problem hiding this comment.
yep. I will reword it and push the update.
Thx
|
LGTM, if you fix the comment that @ddebroy mentioned we can merge |
Signed-off-by: Dani Louca <dani.louca@docker.com>
|
Thx guys. PR is updated with latest change request. |
| bootStrapIPs := make([]string, 0, len(nDB.bootStrapIP)) | ||
| for _, bootIP := range nDB.bootStrapIP { | ||
| for _, node := range nDB.nodes { | ||
| if node.Addr.Equal(bootIP) { |
There was a problem hiding this comment.
@dani-docker thinking more about this, I guess here it's missing the check that the IP is != from current node IP else this fix won't work for the managers. Every manager will see itself in the list and won't try to reconnect
Steps To Reproduce:
Expected Results:
Actual Results:
overlayand unpredicted networking issues between containers on the 2 different clusters.Workaround:
or
/joinendpointPR fix:
This PR adds a go routine that runs every minute and checks that at least one node from the
bootStraplist (ie: managers nodes) is part of the cluster, if we couldn't find any, then we attempt a/joinfor 10 seconds .Note:
While testing, I ran into an issue, that I couldn't reliably reproduce;
a
leaveevent was followed by a false positivejoinevent was received by aworkerwhen themanagernode was down, this caused an inconsistency in thenDB.nodeswhich caused the logic in this PR to fail.@fcrisciani recommended the below change and he'll be looking into the problem in more details.
Signed-off-by: Dani Louca dani.louca@docker.com