Are Our Secondary Replicas Really Synchronous?

 Hi Friends,

This is very simple post and explores the new setting of what we have on SQL Server 2017. I would like to thank David Fowler as it is inspiration from his blog

Okay, let us begin by understanding what Synchronous really means?

 Hi Friends,

This is very simple post and explore the new setting of what we have on SQL Server 2017.

Okay, let us begin by understanding what Synchronous really means?

Normally, if we have got a secondary replica in synchronous commit mode, our primary cannot commit any transactions until that transaction has been hardened to disk on all the synchronous secondaries.

When you commit a transaction on the primary, SQL Server will send all changes over to its secondary nodes. It will then wait for confirmation from the synchronous nodes that the changes have been hardened before it can commit locally.

This ensures that those synchronous secondaries are always up to date and will mean that in the event of a fail over, there is no risk of data loss. Any delay in the secondary writing down those changes will mean a delay on committing on the primary (you will see this in the form of HADR_SYNC_COMMIT waits on the primary).

The primary replica is constantly monitoring the state of its secondaries. With the use of continuous ping, the primary node always knows if the secondaries are up or down.

It is when SQL detects that one of its synchronous replicas goes offline is when interesting things can happen.

So, here is the discussion that came up, if a synchronous replica goes offline for whatever reason, SQL won’t be able to commit any transactions and that means we can be confident that the secondary is up to date, right?

Well, not quite. Let us try running the following code to simply update a table with a timestamp every second.

We have 3 replicas running on Microsoft SQL Server 2017 (RTM-CU18) and all of them are in synchronous Mode.

I started testing

create table synctest(sname datetime)

TRUNCATE TABLE SyncTest

WHILE 1=1

BEGIN

INSERT INTO synctest VALUES (GETDATE())

WAITFOR DELAY '00:00:01'

END

select * from synctest

After a short while I Stopped SQL Services on one of the secondary replicas and here are the results












There were no records for 10 seconds

Interesting, we can see that there was a delay when the synchronous secondary went offline as we would expect. But then after a short while, it looks like SQL Server decided to continue committing transactions. Is that right?

Session-Timeout

And here is where the misinterpretation came in. Availability groups have a setting, session-timeout.  This timeout controls how long a primary replica will wait to receive a ping from a secondary (by default, this is 10 seconds). Once that timeout period is exceeded, the primary will no longer wait for that replica when committing a transaction, effectively switching it to asynchronous commit.

This is the reason why, after 10 seconds of the secondary going down, we start to see transactions being committed again.

So, My Question is how will we ensure that we can avoid this situation? 

SQL Server 2017 introduced REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT cluster resource setting.

This setting guarantees the specified number of secondary replicas write the transaction data to log before the primary replica commits each transaction






The default setting is Zero and I changed to one and carried out the same exercise as above and still the results are same where we got 10 seconds lag and the transactions continued post that.

Finally, after changing this setting to 2 as expected the transactions didn't run as the primary can't continue because the transactions can't get hardened on to secondary. Below is the error message

(1 row affected)

Remote harden of transaction 'INSERT' (ID 0x0000000001677d4d 0000:0000081c) started at Sep 14 2020 2:03PM in database 'india' at LSN (36:2213:3) failed.

(1 row affected)

Msg 596, Level 21, State 1, Line 3

Cannot continue the execution because the session is in the kill state.

Msg 0, Level 20, State 0, Line 3

A severe error occurred on the current command.  The results, if any, should be discarded.

So Now think about this setting and implement it as per your needs. Below is the link which explains about the setting

REQUIRED_SYNCHRONIZED_SECONDARIES_TO_COMMIT

 

Comments