Hi Mates,
Today I want to share my Analysis with respect to connection Time out Errors in ALWAYS ON.
In my previous post I mentioned about how to go through Windows Cluster Logs
https://www.chiranjeevivamsydba.com/2020/10/my-availability-groups-got-affected-by.html
Always ON is such a topic where we need to examine different logs and need to analyze them properly before coming to any conclusions.
Okay, Let me go through the issue. I got an incident from Application Team indicating they experienced some issues yesterday. When I enquired about the time they said it is at early Morning 3 AM.
At first, I checked if there were any Maintenance jobs running at that time as usually these are times when they run. However there were nothing running at that time.
My replicas are in a Multi-Subnet with 3 nodes in Each Data Center
When checked the Error Logs I could see below messages inside them and all of them were with 3 replicas in another DC.
2020-10-23 03:02:56.780 spid207s A connection timeout has occurred on a previously established connection to availability replica 'XXXXXXX'
2020-10-23 03:02:56.870 spid295s A connection timeout has occurred on a previously established connection to availability replica 'XXXXXXX'
2020-10-23 03:02:56.910 spid295s A connection timeout has occurred on a previously established connection to availability replica 'XXXXXX'
2020-10-23 03:02:56.780 spid207s Always On Availability Groups connection with secondary database terminated for primary database 'ReportServerTempDB' on the availability replica 'XXXXXX' with Replica ID: {12769114-d890-4e2d-b4fa-7c346244be1e}.
2020-10-23 03:02:56.780 spid299s Always On Availability Groups connection with secondary database terminated for primary database 'ReportServer144' on the availability replica 'XXXX' with Replica ID: {12769114-d890-4e2d-b4fa-7c346244be1e}.
So Now how will we troubleshoot this.
1) Error Logs: Those are the Messages what I can see
2) Always ON Health Sessions(Extended Events): They too have the same messages of what I could see inside Error Logs.
3) SQL Diagnostic Extended Event files: These are available only on Primary Replica and I use them a lot for Investigation Purposes.
There are 2 things I would like to discuss here firstly at the highlighted part under component state we see it as Warning. Also it says we have something related to Blocking. I got an impression that Blocking was the culprit however the database is not on ReportServer or ReportServerTEMPDB which are the ones that got affected.
4) System Health Session (Extended Events): This is kind of redundant to that of SQL Diagnostic in this case even it says the same information to that of above.
5) Windows Cluster Logs: We didn't have any Failover hence there is not much here apart from the below message. (I understood a lot what are the messages that gets logged for Connection time out Errors but I can't post here as it deviates from the topic)
00001d1c.00002af8::2020/10/23-03:02:56.370 INFO [RES] SQL Server Availability Group: [hadrag] SQL Server component 'query_processing' health state has been changed from 'clean' to 'warning' at 2020-10-23 03:02:56.360
Frankly Speaking none of the logs helped me to understand what is the real contributor for these Connection time outs. Also there were no snapshots(VSS) or Antivirus related stuff running at that time so we can rule out them too.
I then thought of reproducing the Errors on my Test machine and I did that using Networking Tools.
As you can see I simulated the stuff and at the same time I could see the message
A connection timeout has occurred while attempting to establish a connection to availability replica 'XXXXXX' with id [B4DF270A-D29D-451A-B2BD-4727323D7234]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.
So How will I troubleshoot this? My Answer is try to run a light weight ping test from the source (primary replica) to Secondary servers and see if you are seeing any time outs
Below is the command what I made use of
ping.exe -t 192.168.58.68|Foreach{"{0} - {1}" -f (Get-Date),$_} > D:\backups\192.168.58.68.txt
As you can see from the above image I got the output in to Notepad file. I would only recommend this if you are seeing connection time out messages quite frequently.
Hope you have Learned Something with this.
Comments
We also encounter these timeouts, and can't figure out what's the cause and how to eliminate these errors.