My Availability Groups got Affected by Covid-19

Hi Friends,

By looking at the title one might feel bit funny however eventually down the line you will feel it apt. This will be a lengthy post as I will show you how I performed detailed Investigation hence be little patient but I promise you will learn something new for sure.

Let me start the issue on Sunday early Morning I got a call at 6:30 stating that availability groups got affected but after 2 minutes I didn't receive any other call. Being Sunday I was little relaxed soon after 30 minutes I started getting calls on the same server this time I didn't took that easily and connected to the servers to see what's the issue. 

When I opened Cluster administrator I could see below state for one of the nodes








By looking at this I was surprised to see one of my nodes (Primary Replica) went in to Quarantine state. My first impression was that My AG got affected (of course not by Covid-19). I read the concept of Windows nodes getting quarantined and it has been introduced from windows 2016. I ran the command  Start-ClusterNode –ClearQuarantine  and after few seconds everything went normal.

So what is Quarantine? 

It is not desirable to have a node constantly falling in and out of cluster membership, which is likely caused by some underlying configuration or hardware issues. This “flapping” of membership can cause resources to move constantly between nodes, resulting in service interruption. Thus Windows Server 2016 introduces the Quarantine logic to stop this from happening. 

Quarantine is triggered for a node if the node ungracefully leaves the cluster three times within an hour. By default, that node will stay in Quarantine state for 2 hours, during which time it is not allowed to rejoin the cluster nor host cluster resources

Not only Humans even the nodes are supposed to go in to Quarantine state. 

Now let me share the Environment details, we have 3 nodes in one Data Center and 3 in other Data Center and File share witness as we have even number of nodes.

Below is the Log information for SQL Server:

2020-09-27 09:23:29.410 spid29s      Failed to update Replica status within the local Windows Server Failover Clustering (WSFC) due to exception 41034.

2020-09-27 09:23:29.410 spid29s      Failed to update Replica status within the local Windows Server Failover Clustering (WSFC) due to exception 41005.

2020-09-27 09:23:29.900 Server       Always On Availability Groups: Local Windows Server Failover Clustering node is no longer online. This is an informational message only. No user action is required.

2020-09-27 09:23:29.910 spid19s      Always On: The availability replica manager is going offline because the local Windows Server Failover Clustering (WSFC) node has lost quorum. This is an informational message only. No user action is required.

2020-09-27 09:23:29.910 Server       The availability group 'XXXX' is being asked to stop the lease renewal because the availability group is going offline. This is an informational message only. No user action is required.

2020-09-27 09:23:29.910 spid19s      Always On: The local replica of availability group 'XXXX' is stopping. This is an informational message only. No user action is required.

2020-09-27 09:23:29.910 spid19s      The state of the local availability replica in availability group 'UAE-AG' has changed from 'PRIMARY_NORMAL' to 'RESOLVING_NORMAL'.  The state changed because the local instance of SQL Server is shutting down.

After some time( perhaps few seconds) the replica came back online without my intervention. Post that I could see below information inside my logs.

2020-09-27 09:24:45.060 spid19s      Always On Availability Groups: Local Windows Server Failover Clustering node is online. This is an informational message only. No user action is required.

2020-09-27 09:24:45.970 Logon        Error: 983, Severity: 14, State: 1.

2020-09-27 09:24:45.970 Logon        Unable to access availability database 'XXXXXX' because the database replica is not in the PRIMARY or SECONDARY role. Connections to an availability database is permitted only when the database replica is in the PRIMARY or SECONDARY role. Try the

2020-09-27 09:24:47.050 spid52       The state of the local availability replica in availability group 'UAE-AG' has changed from 'RESOLVING_NORMAL' to 'PRIMARY_PENDING'.  The state changed because the availability group is coming online.  

2020-09-27 09:24:47.080 Server       The state of the local availability replica in availability group 'XXXXX' has changed from 'PRIMARY_PENDING' to 'PRIMARY_NORMAL'.

As you can see the Availability Group went offline due to issues at Windows clustering level but it didn't failover to any other replica. you can find the Always ON Failover History using the below link

https://www.chiranjeevivamsydba.com/2020/01/always-on-failover-history.html

Let me show you the Windows Cluster logs at the same time. Interpreting the Cluster logs is not that simple as it is not our comfort Zone.






By looking at this we can say this is not the right Voting Mechanism as this being multi subnet the best recommendation is not to set votes for other site of the cluster as if there are any interruptions this will cause impact to the Availability Groups on the primary site.

This being Even number of nodes(6) we do have file share witness as a tie breaker.




As you can see from above in total we have 7 Votes so to have quorum intact we should have 4 votes.

When you are interpreting  cluster logs it has it's own language my first task is get the node names in the order what it thinks

<NODE id='1' name='XXXXXXX2'/> 

<NODE id='2' name='XXXXXXX1'/>

<NODE id='3' name='XXXXXXX2'/>

<NODE id='4' name='XXXXXX01'/>

<NODE id='5' name='XXXXXXX3'/>

<NODE id='7' name='XXXXXXX3'/> 

As you can see it thinks primary replica as NODE id=1 and then it skipped number 6 and other names lasted up to NODE id=7. With this in place we can easily break down the logs

Also if possible get the IPV6 address of each node as the cluster communicates with them

NODE4->fe80::9471:eaf1:16b8:6f66(10.8.52.4)

NODE3->fe80::8c93:1035:3a4f:89b0(10.8.52.5)

NODE7->fe80::a12c:fd7f:95b3:ec7b(10.8.52.14)

NODE5>fe80::c83f:5276:425d:626f(10.7.52.14)

NODE2->fe80::a1bd:2adc:3102:5cf4(10.7.52.4)

NODE1->fe80::f9ff:6084:a429:4df8(10.7.52.5)

000026c4.00000068::2020/09/27-09:23:00.742 INFO  [IM] got event: LocalEndpoint 10.7.52.5:~3343~ has missed two-fifth consecutive heartbeats from 10.8.52.4:~3343~

000026c4.00000068::2020/09/27-09:23:00.757 INFO  [IM] got event: LocalEndpoint 10.7.52.5:~3343~ has missed two-fifth consecutive heartbeats from 10.8.52.5:~3343~

000026c4.00000068::2020/09/27-09:23:00.773 INFO  [IM] got event: LocalEndpoint 10.7.52.5:~3343~ has missed two-fifth consecutive heartbeats from 10.8.52.14:~3343~

The Cluster would communicate on port 3343 with the other nodes. In my case the primary replica (10.7.52.5) is not able to communicate with other side of the cluster nodes(10.8.52.4,10.8.52.5 and 10.8.52.14)

000026c4.000023f4::2020/09/27-09:23:02.008 ERR   [NODE] Node 1: Connection to Node 4 is broken. Reason (10060)' because of 'channel to remote endpoint fe80::9471:eaf1:16b8:6f66%4:~3343~ has failed with status 10060'

000026c4.000030b8::2020/09/27-09:23:02.009 ERR   [NODE] Node 1: Connection to Node 3 is broken. Reason (10060)' because of 'channel to remote endpoint fe80::8c93:1035:3a4f:89b0%4:~3343~ has failed with status 10060'

000026c4.00003094::2020/09/27-09:23:02.040 ERR   [NODE] Node 1: Connection to Node 7 is broken. Reason (10060)' because of 'channel to remote endpoint fe80::a12c:fd7f:95b3:ec7b%4:~3343~ has failed with status 10060'

so it says the connection is broken due to missing heartbeats. Then it tries to establish the connection with other dynamic ports but it was not able to make it.

000026c4.00001774::2020/09/27-09:23:09.847 INFO  [SV] New real route: local (10.7.52.5:~3343~) to remote  (10.8.52.4:~57315~).

000026c4.00001774::2020/09/27-09:23:09.847 INFO  [SV] Got a new incoming stream from 10.8.52.4:~57315~

000026c4.00001774::2020/09/27-09:23:09.597 INFO  [SV] New real route: local (10.7.52.5:~3343~) to remote  (10.8.52.5:~53800~).

000026c4.00001774::2020/09/27-09:23:09.597 INFO  [SV] Got a new incoming stream from 10.8.52.5:~53800~

000026c4.00001874::2020/09/27-09:23:09.566 INFO  [SV] New real route: local (10.7.52.5:~3343~) to remote  (10.8.52.14:~63184~).

000026c4.00001874::2020/09/27-09:23:09.566 INFO  [SV] Got a new incoming stream from 10.8.52.14:~63184~

So you can see that it tried to establish the communication but eventually it failed and came up with below messages

000026c4.00000068::2020/09/27-09:23:12.732 WARN  [NDP] All routes for route (virtual) local fe80::f9ff:6084:a429:4df8:~0~ to remote fe80::9471:eaf1:16b8:6f66:~0~ are down

000026c4.00000068::2020/09/27-09:23:12.748 WARN  [NDP] All routes for route (virtual) local fe80::f9ff:6084:a429:4df8:~0~ to remote fe80::8c93:1035:3a4f:89b0:~0~ are down

000026c4.00000068::2020/09/27-09:23:12.764 WARN  [NDP] All routes for route (virtual) local fe80::f9ff:6084:a429:4df8:~0~ to remote fe80::a12c:fd7f:95b3:ec7b:~0~ are down

So it tried her best to establish but after trying all the routes it finally says there is no route as all the routes got exhausted which means one side of the cluster is not able to communicate with other side.

000026c4.000011dc::2020/09/27-09:23:18.019 INFO  [CORE] Node 1: Proposed View is <ViewChanged joiners=() downers=(3 4 7) inExtendedGrace=() newView=22505(1 2 5) oldView=22302(1 2 3 4 5 7) joiner=false form=false/>

After that it re-arranges her self with downers and joiners information as you can see in my case the other side of the cluster(3,4,7) are down and only left were (1,2,5). Now as I told you earlier we do have file share witness in place which should help us in these kind of scenarios. Remember this being ungraceful we will not have the concept of last man standing. 

Now this is the most critical part 

00001644.00001fc4::2020/09/27-09:23:28.331 INFO  [RES] File Share Witness <File Share Witness>: Releasing temporary lock on witness file share.

00000c20.00002518::2020/09/27-09:23:28.332 INFO  [QUORUM] Node 5 CompareAndSetWitnessTag: 

Attempting to update witness tag (2020/04/03-08`23`05.525_146:2020/04/03-08`23`05.525_146:34575) to 2020/04/03-08`23`05.525_145:2020/04/03-08`23`05.525_145:34576.

00000c20.00002518::2020/09/27-09:23:28.332 WARN  [QUORUM] Node 5 CompareAndSetWitnessTag:

witness tag (2020/04/03-08`23`05.525_146:2020/04/03-08`23`05.525_146:34575) is better than proposed tag (2020/04/03-08`23`05.525_145:2020/04/03-08`23`05.525_145:34576).

00000c20.00002518::2020/09/27-09:23:28.332 INFO  [QUORUM] Node 5 CompareAndSetWitnessTag: releasing witness share lock

00001644.00001fc4::2020/09/27-09:23:28.332 INFO  [RES] File Share Witness <File Share Witness>: Ignoring request to release witness share because it is not currently locked.

My Quorum is there on primary site however as per above logs it failed to obtain the lock as other node was successful in getting the lock which resulted in loss of Quorum. 





















As you can see from the above screen shot my file share witness doesn't have the latest copy if this would have been disk witness we could have definitely avoided the issue. 

00000c20.000014f4::2020/09/27-09:23:28.332 INFO  [QUORUM] Node 5: death timer is already running, 79 seconds left.

00000c20.00002518::2020/09/27-09:23:28.332 WARN  Arbitration failed with error code 995, not emitting QuorumArbitratred message

00000c20.00002518::2020/09/27-09:23:28.332 ERR   Shutting down because arbitration failed (status = 5925)

000007d8.00002f28::2020/09/27-09:23:28.332 WARN  [RHS] Cluster service has terminated. Cluster.Service.Running.Event got signaled.

000026c4.00000718::2020/09/27-09:23:28.366 ERR   [NODE] Node 1: Connection to Node 5 is broken. Reason (10054)' because of 'channel to remote endpoint fe80::c83f:5276:425d:626f%4:~3343~ has failed with status 10054'

000026c4.000011dc::2020/09/27-09:23:29.391 ERR   [QUORUM] Node 1: Lost quorum (1 2)

000026c4.000011dc::2020/09/27-09:23:29.391 ERR   [QUORUM] Node 1: goingAway: 0, core.IsServiceShutdown: 0

Now when it comes to voting we already lost 3 votes due to missing heart beats on other site followed by file share witness and one of the nodes in same site means we are left with just 2 votes

We can see from the above messages that file share witness was obtained by another site as a result we lost our quorum  and there after to avoid split brain one of the nodes terminated the cluster service.

soon After it got recovered on it's own. After 30 minutes again I got the below messages inside SQL logs

2020-09-27 09:55:44.940 spid26s      Failed to update Replica status within the local Windows Server Failover Clustering (WSFC) due to exception 41034.

2020-09-27 10:01:22.340 spid208s     Failed to update Replica status within the local Windows Server Failover Clustering (WSFC) due to exception 41034.

As I mentioned above if a node ungracefully leaves 3 times in an hour which in my case (9:23, 9:55 and 10:01) the node will go in to Quarantine state. 

Now I hope you will agree with the Title. Also the solution what I ran i.e.Start-ClusterNode –ClearQuarantine might not work in all cases. Hence I would suggest you to try to this as a last resort.

In case if you have any queries or comments do reach me on +353894316824/+919739093905(WhatsApp)

Comments