Category Archives: Database

The Transaction Log, Delayed Durability, and considerations for its use with In-Memory OLTP – Part I

warhol_smaller

If Warhol did transaction Logging…

In this mini-series of posts, we will discuss how the mechanics of SQL Server’s transaction logging work to provide transactional durability. We will look at how Delayed Durability changes the logging landscape and then we will specifically see how In-Memory OLTP logging builds upon the on-disk logging mechanism. Finally, we will pose the question “should we use Delayed Durability with In-Memory or not” and discuss this scenario in detail. But in order to understand how delayed durability works, it is first important for us to understand how the transaction log works -so we shall start there…

Caching caching everywhere nor any drop to drink!

SQL Server is a highly efficient transaction processing platform and nearly every single operation performed by it, is usually first done so within memory. By making changes first in memory, the need to touch physical resources (such as physical disk IOPs) are reduced (in some cases substantially), and reducing the need to touch physical resources means those physical boundaries (and their limitations) have less impact to the overall system performance. Cool right?!

A little bit about the Buffer Cache

All data access such as insertion, deletion or updates are all first made in-memory, and if you are a Data Administrator, you will (or should!) already be familiar with the Buffer Cache and how it works. When data pages are required to fulfill data access requests (and assuming they are not already present in the Buffer Cache), the Storage Engine first reads those data pages into the Buffer Cache before performing any requested operations on those pages in memory. If the pages in memory are modified, those dirty pages will ultimately need to be persisted directly to disk (i.e. to their respective physical data file/s) through automatic CHECKPOINTing and will contain changes performed by either committed or uncommitted transactions. In the latter case of uncommitted transactions, the presence of those dirty pages in the data file/s is why UNDO (found in the transaction log) is performed upon recovery to roll back those changes into the buffer cache. Conversely, any dirty (buffer cache) pages that are a result of committed transaction changes but have not yet been hardened to disk (through CHECKPOINT) would require the REDO portion of the transaction log to roll those changes forward on recovery. I will refer back to this again when we move on to talk about In-Memory OLTP.

It’s all about the log, about the log, no trouble!

But It is not the hardening of data to data files that we should really be focusing on here (since there is little relevance to delayed durability), we are more concerned about transactional durability in SQL Server.

There is a common misunderstanding with SQL Server DBAs that when transactions are making changes, the transactional changes are written immediately to the logfile. The durability claims of SQL Server make this misunderstanding very easy to understand, but requiring a physical IO operation to occur upon every single transactional data modification would clearly be a huge potential bottleneck (and dependency) upon the transaction log disk performance. The SQL Server architects recognised this challenge, so in a similar way that data page modifications are first cached, so are the transactional changes. In the case of the transaction log, the “cached” area is provided by in-memory constructs known as log buffers that can store up to 60 Kilobytes of transaction log records.

Now that you know that log buffers are there to delay (or buffer) transaction logging to maximize the efficiency of a single physical IOP to disk, we must now consider when these structures must absolutely be flushed to physical disk and stored within the transaction log in order to still provide the Durability property of ACID that SQL Server adheres to. There are two main situations a log buffer is flushed:

  1. On transactional COMMIT. When a client application receives control back after issuing a successful COMMIT statement, SQL Server *must* provide durability guarantees about the changes that have occurred within the transaction. Therefore all transactional operations must be written and persisted to the transaction log, ultimately requiring that the log buffer containing those changes must be flushed on COMMIT to provide this guarantee.
  2. On log buffer full. Strictly speaking, as long as the first rule is adhered to, there is no logical requirement that SQL Server should flush the log buffer to disk when it becomes full. However, if you consider that it is paramount for SQL Server to never run out of available log buffers, then it is pretty obvious that the best way for it to avoid this situation is to hold onto the log buffer only as long as it needs to. If a log buffer is full, then it serves no further purpose in the buffering story and furthermore, given that the log buffer will contain a maximum of 60 Kilobytes of changes, there is good alignment to the physical write to the transaction log. When the log buffer is full it makes sense to flush it to disk.

tranlog_smaller
Now we have a better understanding how the transaction log provides transactional durability to SQL Server and how it delays physical resource writes to the transaction log disk to improve logging performance, we can look at how Delayed Durability changes the logging landscape, why we might use it and what we are risking by doing so in the second post in this series.

Problem removing files from TempDB

I recently ran into an interesting problem while attempting to remove approximately half of the TempDB data files configured in a testing environment. As you might expect, there are various SQL Server tasks that are performed infrequently by a DBA, and this is a good example of one of them. Most environments have usually been misconfigured with too few TempDB data files or wrongly sized usually resulting in the classic allocation page contention problems (which is explained by this excellent SQLSkills article “The Accidental DBA (Day 27 of 30): Troubleshooting: Tempdb Contention“), but this particular environment had been temporarily over provisioned with disks and TempDB data files (for testing purposes).

Bonus fact: SQL Server 2016 attempts to automatically address the tempdb datafile allocation page contention problem by defaulting to 8 TempDB data files (or less if the number of cores is smaller). This can be overridden on install through the GUI or by using the /SQLTEMPDBFILECOUNT switch if performing a command line installation. Further optimisations have been implemented such as the adoption of uniform extent allocations by default and auto-growing all files simultaneously -both behaviours would formally have required turning on Traceflags 1117 and 1118. Several other improvements have been made to the performance of TempDB database which you can read about yourself in this (currently preview) documentation.

 

So the plan was to remove each TempDB file one by one, restart the instance (if required) and decommission those spare disks so that they can be returned and reallocated elsewhere in the environment. In order to remove each data file we would:

  1. Empty each file at a time (using DBCC SHRINKFILE).
  2. Remove each file upon empty.

Now there was a time (prior to SQL Server 2005) when we were told to tread very very carefully when shrinking TempDB files, and doing so could result in corruption. This perception remains with some of the longer serving SQL Server professionals (I question myself frequently) but if we take a look at KB307487 article “How to shrink the tempdb database in SQL Server” we can see that it is now safe to do so -although (as the Knowledge Base article states) certain scenarios can cause the shrink operation to fail.

So I ran the following code for each TempDB data file:

DBCC SHRINKFILE (tempdb_15, EMPTYFILE)
go
ALTER DATABASE [tempdb] REMOVE FILE [tempdb_15]
GO

The code succeeded for the first few files, right up to TempDB data file 12, where I hit the following error:

DBCC SHRINKFILE: Page 14:56 could not be moved
because it is a work table page.
Msg 2555, Level 16, State 1, Line 1
Cannot move all contents of file "tempdb_12" to
other places to complete the emptyfile operation.
DBCC execution completed. If DBCC printed error
messages, contact your system administrator.
Msg 5042, Level 16, State 1, Line 1
The file 'tempdb_12' cannot be removed
because it is not empty.

As odd as this error was (especially since the SQL instance was not currently “in use”) I decided to bounce it but after restarting the instance was again greeted with the same error message! After some very quick Google-Fu I came across an old MSDN Forum question “DBCC SHRINKFILE: Page 4:11283400 could not be moved because it is a work table page.” I ruled out some of the responses but came across the response by Mike Rose:

“I realize this is an old thread but I have found that in most cases work tables are related to Query Plans.

Try issuing the following commands and the shrinking the tempdb:

DBCC FREESYSTEMCACHE (‘ALL’)

DBCC FREEPROCCACHE

there will be some performance hit for this as SQL will have to recreate its Query Plans, but it should allow you to shrink you TEMPDB.

Hope this helps,

M Rose”

 
Now my initial thoughts were that this response was clearly nonsense since I had just restarted the SQL Service for the instance (thereby flushing the SQL caches), but hey what the hell, why not give it a whirl….

…and success! It worked.

Apart from the flushing of the caches, the outcome was even more strange to me for another reason. Upon reboot TempDB is “recreated” upon startup*1 and I would have expected that fact alone to have fixed the problem but from the behaviour I had experienced, something had persisted across instance restart.
*1 you may want to read this interesting post by Jonathan Kehayias “Does tempdb Get Recreated From model at Startup?

Also of interest to me was whether running FREEPROCCACHE and FREESYSTEMCACHE was overkill so when the opportunity arose I attempted first to try only clearing the systemcache specific to TempDB through:

DBCC FREESYSTEMCACHE ('tempdb')

…and then tried clearing temporary tables and table variables through:

DBCC FREESYSTEMCACHE ('Temporary Tables & Table Variables')

…and then by using both together. Sadly these did not seem to work.

On my final opportunity I tried clearing the procedure cache (DBCC FREEPROCCACHE) and after running this a few times it appeared to solve the problem and I was able to remove the TempDB file without error.

 


Clearing system caches on a Production SQL Server instance is never a good thing and should be avoided, but at least the next time I run into this problem I have (what appears) to be a work around to the problem I ran into and at least the requirement to remove TempDB data files should be a rare occurrence for most of us!

I confess that I am left with a few unanswered questions and clearly what I experienced does make me doubt the results somewhat (and my sanity) and there were further tests that I would like to perform at some point. So if you run into this problem yourself I encourage you to try this (disclaimer: your fault if you destroy your SQL Server 😉 ) and I would be delighted if you leave feedback of what works for you (and what doesn’t)!

Incrementing SQL Sequences

Sequences were first introduced in SQL Server 2012 and are a way to improve performance in situations that would have traditionally been implemented using the column based identity property. One of the biggest downfalls of the identity property is that new values are generated on inserts meaning that they are also transactionally dependent. Sequences on the other hand are not and allow for caching and many other benefits which you can compare and contrast in this really good SQL Server Pro magazine article by Itzik titled SQL Server 2012 T-SQL at a Glance – Sequences.

Recently our performance environment was undergoing load testing using pre-loaded synthetic data which (upon execution) started to result in failures from identity conflicts. It was fairly obvious that our current sequence seed values were much lower than the loaded data identities. Not so much of a problem you might think, since we can easily reseed the sequence number via the Sequence Properties dialog (below). Simply select the Restart sequence checkbox and type your new seed number into the entry box next to it and click OK.

sequence properties

The only problem with this approach is that our database was configured (rightly or wrongly) with approximately 250 sequences! Since we could not be sure which sequences would ultimately cause us problems we decided to increment each one by 10,000.

Not being someone who likes performing monotonous tasks and also recognising the fact that this task would probably need to be performed again in the future I decided to attempt to programmatically solve this problem.


How to use it

Disclaimer: Before using the following script, please make sure you understand what you are doing and where you are doing it. The responsibility for its use and misuse is all yours!

 
The script below is fairly basic and generates a script to update every single sequence within your database (make sure you change context to the correct one) with a default increment of 10000 (feel free to alter as necessary). If you only want to update a range of sequences then obviously you should add a WHERE clause to this script and filter on whatever criteria floats your boat.

DECLARE @increment_sequence INT = 10000
SELECT 
   'ALTER SEQUENCE [' + SCHEMA_NAME(seq.schema_id) 
   + '].[' + seq.name + ']'
   + ' RESTART WITH '+ CAST(CAST(seq.start_value AS INT) 
   + @increment_sequence as VARCHAR(max)) + ';'
FROM
   sys.sequences AS seq
   LEFT OUTER JOIN sys.database_principals AS sseq
ON
   sseq.principal_id = ISNULL(seq.principal_id, 
   (OBJECTPROPERTY(seq.object_id, 'OwnerId')))
ORDER BY
   SCHEMA_NAME(seq.schema_id) ASC,
   seq.[Name] ASC

It creates a script in your query results as below. Simply copy and paste this into a new query window and execute.

ALTER SEQUENCE [DB_Sequence].[LoanID] RESTART WITH 10033;
ALTER SEQUENCE [DB_Sequence].[TransferID] RESTART WITH 10000;
ALTER SEQUENCE [DB_Sequence].[AccountID] RESTART WITH 68719;
ALTER SEQUENCE [DB_Sequence].[CustomerID] RESTART WITH 1010006;

If you do need to update many sequences in your database I hope you find this script useful and it saves you as much time as much as it has me!

Restoring CDC enabled databases

surprisedPerforming database and transaction log backups (and restores) are the meat and veg of a DBA’s responsibilities and if you are a seasoned professional I am sure you will have performed those operations ad-infinitum. Being complacent is perhaps the biggest mistake you can make as a “Guardian of Data” so you should always be prepared for the unexpected…

Several years ago in a Disaster Recovery scenario, I was asked to perform the migration of a large database which was used for a fairly complex ETL process that was both undocumented and something that the DBA team had no insight or involvement.

Due to a complete lack of a Disaster Recovery Plan for the system and process (how many times have we seen this guys!) I was forced to follow best practices, common sense and draw on my experiences in order to bring the service back to a “best known good” as quickly as possible. Since we had access to all existing backups taken so far, and having the ability to take a tail log backup I was able to recover across to DR with no data-loss (at least as far as I was concerned). I handed the environment back to the Business Intelligence team for their validation, operational testing and re-acceptance.

Of course it passed acceptance and resumed operation in production.

Now of course, what was not known at the time I performed this work was that the Database in question was enabled for Change Data Capture (CDC). Perhaps it should have been a logical assumption for me to make, but I suspect that even if I was aware of its presence, the significance of the situation might have escaped me.

I am not really sure what acceptance testing was performed (if indeed it was!), because after a few hours of activity (just enough time for a scheduled ETL process to execute and fail) it became clear that something was very wrong with the database.

Apparently the Change Data Capture tables were missing!

 
So now I knew about the existence of CDC in the database (and after digging around MSDN documentation for a few minutes) our mistake was obvious – we had not used the KEEP_CDC option upon restore, meaning that the CDC tables were no longer present. Unfortunately acceptance testing hadn’t detected this problem ahead of time and now our recovered database was live. After some discussion, the Business Intelligence team decided for us to re-create CDC on this warm database and to temporarily restore a copy of the old one (with KEEP_CDC!) to pull out those missed changes.


So what’s the problem with restoring CDC enabled databases?

The main problem with CDC enabled databases is that the KEEP_CDC option is incompatible with the NORECOVERY option.

For instance the following command:

RESTORE DATABASE [CDCEnabledDB] 
	FROM DISK = 'CDCEnabledDB.bak' 
	WITH NORECOVERY, KEEP_CDC

Will result in:

Msg 3031, Level 16, State 1, Line 1
Option 'norecovery' conflicts with option(s) 'keep_cdc'.
Remove the conflicting option and reissue the statement.

This automatically poses the question of how it is possible to restore a backup chain with CDC? On a database restore, in order to apply differential backups and transaction logs the NORECOVERY clause is required to prevent SQL Server from performing database recovery.

If this option is required but KEEP_CDC in conjunction with it is incompatible, surely this means point in time restores are not possible for restores that require CDC tables to be retained?

-Wrong!


The investigation

When I first ran across this problem (and after reading Connect Item 587277 -which has now been updated by me with the correct resolution) I was incensed that Microsoft could be so stupid as to prevent something so necessary under certain situations. I saw it as a huge gaping hole in database recovery for any database requiring CDC. As I (and others on the Connect Item) saw it, if you cannot restore a log chain and KEEP_CDC then this will cause problems for:

  • Database Mirroring
  • Availability Groups
  • Log Shipping
  • Replication
  • Recovery to Standby

Quite infuriatingly the Connect Item was closed with:

“we looked into this but there is no plan as of now to change the current design and extend the support of this in the foreseeable future. Again, thanks for taking the time to share your feedback, this is really important to us.”

And that was that…for a time.


So what has changed?

Several months ago I was investigating the use of CDC enabled tables for a client project when I stumbled across the following statement in the MSDN documentation Replication, Change Tracking, Change Data Capture, and AlwaysOn Availability Groups :

“SQL Server Replication, change data capture (CDC), and change tracking (CT) are supported on AlwaysOn Availability Groups.”

Remembering the problems I faced years earlier I read that statement with interest and decided to spend time playing with the recovery of Change Data Capture when it dawned on me that mine and the other posters to the Connect Item (including the Microsoft respondent) had completely misunderstood the use of the KEEP_CDC clause.

Our misunderstanding was (unsurprisingly) that the KEEP_CDC clause was necessary to protect Change Data Capture tables from being lost on every restore sequence (which is wrong). Instead the clause is required to protect CDC tables upon RECOVERY. Change Data Capture schema and data is transparently restored behind the scenes during NORECOVERY in anticipation that you use the KEEP_CDC clause on RECOVERY -if not then CDC is dropped (again transparently).

Therefore in a restore sequence, recovery is the ONLY time that you need to specify KEEP_CDC in your restore sequence meaning that the incompatibility between this option and NORECOVERY is irrelevant.

For instance the following will work:

RESTORE DATABASE [CDCEnabledDB] 
	FROM DISK = 'CDCEnabledDB.bak' 
	WITH NORECOVERY

RESTORE LOG [CDCEnabledDB] 
	FROM DISK = 'CDCEnabledDB_1.trn' 
	WITH NORECOVERY

RESTORE LOG [CDCEnabledDB] 
	FROM DISK = 'CDCEnabledDB_2.trn' 
	WITH RECOVERY, KEEP_CDC

In summary

Regardless of however experienced you believe yourself to be with SQL Server, there is always something that can come and bite you. Not even Microsoft Support and technology owners always fully understand how every single scenario or configuration is going to play out, so it is important that you Do Your Own Research.

While the use of CDC does introduce certain considerations that I will talk about another time, you should rest assured that as long as you understand the use of the KEEP_CDC clause (and make sure you use it when necessary) then you will not go far wrong.

Oracle Unbreakable Linux installation fails on Hyper-V Generation 2 Virtual Machine

Have you been attempting (and failing) to install Oracle Unbreakable Linux as a Virtual Machine under Hyper-V and cannot figure out what is wrong?

If you are receiving the following message:

Boot Failed. EFI SCSI Device.
Boot Failed. EFI SCSI Device.
Boot Failed. EFI SCSI Device. Failed Secure Boot Verification.
PXE Network Boot using IPv4
......
PXE-E18: Server response timeout.
Boot Failed. EFI Network.
No Operating System was Loaded.
Press a key to retry the boot sequence...

Secure Boot verificationThen I bet you are using a Hyper-V generation 2 VM?
When I first saw the error message it is quite clear that something is not working properly with the boot sequence and my first thought was that either my install media was corrupt or that there was an incompatibility between the Oracle Linux boot media and Hyper-V generation 2 virtual machines (the error message gives a big clue on this).

Update 11th October 2016

Before continuing with this article, you should know that there is a slightly better way to address this problem, since you are also very likely to run into the same (or similar) problem on CentOS, Ubuntu, or anything else that is not Microsoft Windows in Hyper-V. For instance, I have noticed that on an attempted CentOS installation within a version 8 generation 2 Hyper-V VM (Windows 10 Anniversary Update Edition)  there is not even a visual clue given that there is a Secure Boot failure -at least not on the initial boot (as before).

All we see is:

centos_no_boot
However, if you wait long enough you will eventually arrive at the following screen:

no-uefi

So rather than disable Secure Boot as this blog post instructs, I recommend changing Secure Boot to use the Microsoft UEFI Certificate Authority template rather than the Microsoft Windows template. Make this change through the Virtual Machine settings page (Security node in the hardware pane) and if it is not already obvious, your VM must be stopped in order to change this setting.  Once you have made the change, your problems should be resolved and your Linux distribution should automatically boot.

secure_boot

If you do not see this option available to you, then feel free to proceed with the alternative route as described below.

 
Immediately I tried installing to a generation 1 VM and it ran through smoothly without incident (proving media was fine), so I returned back to my generation 2 VM to resolve the issue. Returning back to the error message, the Failed Secure Boot Verification warning stands out like a sore thumb, and Hyper-V afficionados will recognise that secure boot was actually introduced by generation 2 VMs. Thankfully it is very easy to turn this off and disable secure boot through the System Center/ Firmware option within Virtual Machine Properties pages (or via Hyper-V Manager/ Settings/ Firmware option). Alternatively, we can also do this through PowerShell as follows:

Set-VMFirmware –VMName "VMname" -EnableSecureBoot Off

The next time the machine boots, the installer should automatically launch (and if it isn’t clear already, you must leave secure boot disabled post install).

installing

For more information on this subject see Oracle Linux virtual machines on Hyper-V and visit What’s New in Hyper-V for Windows Server 2012 R2 for and in-depth discussion of Hyper-V new features in Windows 2012 R2.

VIEW, Pugh, Barney McGrew…

Trumpton…Cuthbert, Dibble, Grub.

I never ceases to amaze me that there is always that little bit of SQL Server functionality that you didn’t know about. Some of it good and some of it bad, some of it just meh…

The other day I was stumbling around a new environment, when I decided to use the GUI to script out an existing set of views when I saw something that had me scratching my head. Underneath the actual view definition was the following (some names removed to protect the innocent)…

----------------------------------
-- note view definition removed --
----------------------------------
GO
EXEC sys.sp_addextendedproperty @name=N'MS_DiagramPane1',
@value=N'[0E232FF0-B466-11cf-A24F-00AA00A3EFFF, 1.00]
Begin DesignProperties =
   Begin PaneConfigurations =
      Begin PaneConfiguration = 0
         NumPanes = 4
         Configuration = "(H (1[40] 4[20] 2[20] 3) )"
      End
      Begin PaneConfiguration = 1
         NumPanes = 3
         Configuration = "(H (1 [50] 4 [25] 3))"
      End
      Begin PaneConfiguration = 2
         NumPanes = 3
         Configuration = "(H (1 [50] 2 [25] 3))"
      End
      Begin PaneConfiguration = 3
         NumPanes = 3
         Configuration = "(H (4 [30] 2 [40] 3))"
      End
      Begin PaneConfiguration = 4
         NumPanes = 2
         Configuration = "(H (1 [56] 3))"
      End
      Begin PaneConfiguration = 5
         NumPanes = 2
         Configuration = "(H (2 [66] 3))"
      End
      Begin PaneConfiguration = 6
         NumPanes = 2
         Configuration = "(H (4 [50] 3))"
      End
      Begin PaneConfiguration = 7
         NumPanes = 1
         Configuration = "(V (3))"
      End
      Begin PaneConfiguration = 8
         NumPanes = 3
         Configuration = "(H (1[56] 4[18] 2) )"
      End
      Begin PaneConfiguration = 9
         NumPanes = 2
         Configuration = "(H (1 [75] 4))"
      End
      Begin PaneConfiguration = 10
         NumPanes = 2
         Configuration = "(H (1[66] 2) )"
      End
      Begin PaneConfiguration = 11
         NumPanes = 2
         Configuration = "(H (4 [60] 2))"
      End
      Begin PaneConfiguration = 12
         NumPanes = 1
         Configuration = "(H (1) )"
      End
      Begin PaneConfiguration = 13
         NumPanes = 1
         Configuration = "(V (4))"
      End
      Begin PaneConfiguration = 14
         NumPanes = 1
         Configuration = "(V (2))"
      End
      ActivePaneConfig = 0
   End
   Begin DiagramPane =
      Begin Origin =
         Top = 0
         Left = 0
      End
      Begin Tables =
         Begin Table = "Customer"
            Begin Extent =
               Top = 6
               Left = 38
               Bottom = 114
               Right = 227
            End
            DisplayFlags = 280
            TopColumn = 0
         End
         Begin Table = "Order"
            Begin Extent =
               Top = 6
               Left = 265
               Bottom = 114
               Right = 492
            End
            DisplayFlags = 280
            TopColumn = 0
         End
      End
   End
   Begin SQLPane =
   End
   Begin DataPane =
      Begin ParameterDefaults = ""
      End
   End
   Begin CriteriaPane =
      Begin ColumnWidths = 12
         Column = 1440
         Alias = 900
         Table = 1170
         Output = 720
         Append = 1400
         NewValue = 1170
         SortType = 1350
         SortOrder = 1410
         GroupBy = 1350
         Filter = 1350
         Or = 1350
         Or = 1350
         Or = 1350
      End
   End
End
' , @level0type=N'SCHEMA',@level0name=N'dbo',
@level1type=N'VIEW',@level1name=N'omsPerMonth'
GO

EXEC sys.sp_addextendedproperty @name=N'MS_DiagramPaneCount',
@value=1 , @level0type=N'SCHEMA',@level0name=N'dbo',
@level1type=N'VIEW',@level1name=N'omsPM'
GO

So after a few tests it turns out that when you create SQL Server Views using the SQL Server Management Studio View Designer, SSMS automatically generates and adds these properties to the view behind the scenes to aid the Designer for re-editing purposes. Please bear this in mind when generating scripts for deployment into your staging environments, whilst there doesn’t appear to be any performance drawbacks to this extra meta-data, it is messy and (imho) not best practice to redeploy into production.

10 reasons why I HAte you!

(or My Top Ten High Availability annoyances)

Danger lurks...

Designing, planning, deploying, administrating and recovering (known from now on as DePlanDeAR – and pronounced in the tone of a Caribbean Grandma 🙂 ) a SQL Server High Availability and Disaster Recovery (HADR) solution is really not an easy thing to implement and maintain. There are many reasons for this and whilst (ultimately) our business is only interested in an almost permanent unbroken connectivity to the Database Engine for its Applications or Middleware Clients, the reality is that our databases/s play just a very small part of the whole availability story when maintaining one of these solutions.

The skills and knowledge required for DePlanDeAR for HADR solutions generally span many different teams, require many different Subject Matter Experts (SMEs) and ideally at least one person to communicate and co-ordinate between them with enough insight and direction to achieve and maintain a robust solution. More often than not, I arrive on client sites only to find High Availability strategies that were thought to be fit for purpose when in fact the opposite is usually true.

In this article I am going to list ten of my favorite (if that is the correct term) reasons why I hate YOUR (yes your!) HADR strategy.


1. You have no idea what Quorum Model you are running under.

Quorum is the mechanism used by your Windows Cluster Nodes and determine which can be considered part of the running cluster or which have lost connectivity. In this way Quorum aims to prevent split brain scenarios in which connectivity between nodes is lost. By default each node has one vote which makes up a Quorum maximum and a node will need to see a visible Quorum majority in order to continue running in the Cluster. A loss of a visible Quorum majority by a node would cause it to go offline (important to note in this scenario the Server itself does not shutdown!) and any cluster resources currently owned and running on it would fail over. Therefore as you might guess, Quorum is one of the most important concepts in Windows Clustering and is used by both SQL Server Failover Clustering and AlwaysOn Availability Groups.

Why then, do so few IT professionals from Windows Admins to Database Administrators still fail to make the effort to find out what exactly your Windows Cluster Quorum Model is? Time and time again I have seen the legacy Disk Only Quorum (a throw back from Windows 2003 and earlier and is a single point of failure) configured in Clusters that are not just two nodes but consist of many nodes. There is no excuse, not in any situation to use Disk Only Quorum these days. You should ensure that:

  1. You understand the concept of quorum and the effect it has to your cluster’s availability, regardless whether you are a DBA or Windows administrator.
  2. Ensure it is changed immediately to a more appropriate model!

There have been significant Changes to Cluster Quorum in Windows 2012 and further advancements in Windows 2012R2. These changes will help make Quorum configuration and Cluster availability significantly easier and more efficient. I shall cover these another time but for now you can read a bit more about Quorum in my post “Weight doesn’t ALWAYS have to be AlwaysOn“.

2. Your Windows, Network, Storage and Database teams work in isolation.

There are not many technologies that require the crossover skills in the way that SQL Server High Availability solutions do. Not only do you have to worry about SQL Server functionality itself (and let’s not forget that it does help if you understand some of these HADR subjects quite deeply to avoid their nuances), but you also need to have a good understanding of Windows, Active Directory, Networking and SANs to name a few others. Probably one of the most common scenarios I encounter is teams working in Silos and communication across them being very poor. There is only ever one outcome to silo based designed HADR strategies -they ALWAYS result in bad designs, bad implementations and unconfident and ill-informed support teams.

Having technical cross-over is good. It gives you perspective, an appreciation of another team’s challenges and the ability to communicate in their language. Nobody ever said you couldn’t specialize in your area of choice and be a technical expert did they? You won’t forget your existing skill set just because you have learnt something new. No, it will give you a foundational platform to build your knowledge. The dots start to connect and you smarter!

3. Your Entire Production infrastructure is all LIVE!

So you have implemented a HADR strategy and all seems to work well right? Eventually you will make it live and run with success for a period of time, giving yourself a big pat on the back. Your fantastic design exists purely because you are amazing and no one else could have achieved such a feat of engineering!

Eventually the time comes to patch your Windows or SQL Servers. Then and only then do you realize that your solution requires you to deploy these to the systems that are currently running as LIVE. This mistake is more common than you would believe and your coupling between your “DR” solution and your “HA” solution is so tight that in order to patch anything in your Disaster Recovery site you have to initiate failover in Live! Sometimes it is possible to get around these situations by temporarily workarounds (such as breaking SAN replication and re-establishing later) but most probably your design only accommodates failover.

If it needs saying, ALWAYS try to decouple any strategies you implement as far as possible. There is nothing wrong with using complementing HADR technologies as long as the use of one does not compromise another.

4. You do not run similar HADR infrastructure in your (cross where applicable) UAT/ QAT/ Systest and Dev environments.

In most organisations, Highly Available strategies are only seen as something worthy for production. I have been lucky to work for an organisation that employed nearly 100 developers and yet even there, almost zero thought had been given to the Development environment’s availability. After I spent some time calculating the cumulative cost of man hours that would be lost in the event of any of the development database servers failing, it was a fairly obvious thing to suggest to them that:  in this scenario their Development environment was more important than Production!

What is more, running similar HADR deployments in other environments allows your Developers to design code that is more likely to be suitable for these platforms, allows your Testers to find platform related problems *before* code hits production and empowers you to accurately trial changes before risking doing so in live. I could go on but…

5. Your management think HADR is easy.

Your DBAs understand SQL Server and your Windows Administrators understand Windows? You might almost be as bold to suggest that in each area of specialism, there are some real experts in those teams. Unfortunately HADR implementations span a whole stack of technologies and skillsets ranging from the obvious (Log Shipping) to the not so obvious (and seemingly unconnected) such as SAN replication or Virtualization. Understanding how all of these offerings can be used and knowing how they interoperate and play with each other can be bewildering, even for SMEs. Yes HADR will take you out of your comfort zone, but you will learn a lot from travelling to it and provide more robust solutions and more stable systems.

Remember to explain this necessity to management and make sure you can help them understand the obvious! If you need training, then explain to them why.

6. You think HADR is easy.

You have been using HADR strategies for quite some time now and many of your peers believe you to have almost Jedi like skills. Heck you may have even started to believe your own hype and think you have every base covered.

…If only things were as simple as that.

It is always important to try to eliminate any single point of failure where possible in any HA solution, but there are too many variables to address when considering your designs. It is impossible for you to ever understand the impact of an Operating System (or firmware) patch to various parts of the environment, but ultimately one day your strategy is going to fail. How long the recovery from the system outage is going to take will depend (in part) on how well you actually understood the solution that was implemented. Having administered a working system over a very long period of time with no failures or downtime does not mean you are capable of restoring operations back to normal should they now go belly up.

Do you have confidence that (if you are given point in time sql backups) you would be able to rebuild a system from the ground up in a disaster scenario within the expected service agreements? If not, then you are running at risk.

7. You are not in possession of an SLA, RTO, RPO or system definition.

When failure or disaster strike (and believe me, if they haven’t yet, sooner or later they will), time and again I see people in positions of power, influence or command start flapping and demanding that operations absolutely have to return to a full working service immediately otherwise no Widgets can be sold by ACME Corp, and repercussions will be serious! Yet these highly charge stressed individuals are the very same people that you have approached on numerous occasions to ask for your Sytems Recovery Point Objectives (RPO),  Recovery Time Objectives (RTO) and Service Level Agreements (SLA). With a shrug of the shoulder they calmly tell you that there isn’t yet any defined agreements but they are “working on it”. Or perhaps even worse, they have given you documents which whilst defining what the RPO, RTO and SLA is for a particular system, they fail to DEFINE the system.

I have often seen people responsible for Business Availability go to great lengths to define the agreements for RPOs, RTOs and SLAs for particular Systems, but fail miserably in defining what *exactly* constitutes “The System”. Every Business system is composed of many different moving parts and subsystems, both technical and non technical platforms. All of these things (as we have discussed already) are generally supported by groups of diverse teams that rarely communicate between themselves. At a higher level there are Business processes sitting on top of these platforms that will have their own nuances and quirks and require specialist knowledge.

Therefore is your “System” the entire thing that is being described above OR are you going to break it down into component parts for your availability agreements? Do you even know whether it is possible to meet objectives *if* you were forced to run recovery in a serial nature (which is so often the case in situations like these). You may find there is not enough time…

8. You do not regularly review OR test your HADR strategy.

Your HADR plan is only as good (and no better than) the competency of everybody involved in its design and those who will execute that strategy in the event of failure. Throw into the mix a whole host of ever changing variables, technologies, services and business processes and suddenly you have a moving target to worry about. On too many occasions I have been witness to scenarios in where “The Business” would never allow a regular fail-over policy and believed that any solution currently in place would (if called upon) just work. Your problem is this; the longer it has been since you last tested your HADR plan/s, the more likely your moving targets will have compromised your solution -right? And if you agree with me; it is far better to experience a failed HADR plan when you don’t have to rely on it than when you do.

It only makes sense then, to regularly review your strategies and try to reduce the risk of failures at any time, whether they occur through a managed test or because of an unseen event. I should also widen the scope further and say that if your company has a solid set of change control processes and procedures in place managing (and publicising) changes across your Enterprise, then it is far more likely that your HADR reviews are going to flag potential issues.

Nuff said!

9. You have no documentation (or your documentation is worthless).

By now it really should be self explanatory that if you do not have any documentation for your recovery strategy and these plans only exist in yours or someone elses head then you are destined to run into big trouble on failure. But more commonly documentation will exist, but it is unnecessarily large and difficult to follow. Maybe you wrote it with a baffoon in mind, but honestly, you do not have to describe in gory detail how to do operations that your specialist technical staff should be able to perform. If you are trying to document an operation such as “Restore database AcmeCorpBigDB and all logs including tail backup with norecovery from the most recent taken on Production server AcmeCorpProd1 to AcmeCorpDR1” then that is all you need to say. You do not have to explain which buttons to press or go into detail about how to do it in TSQL versus a GUI based restore (or even use that funky font you have recently discovered to make it look nice), just get straight to the point. Putting sidenotes that might assist in speeding up the process is just about acceptable, but any detail (for dummies) should be referenced through footnotes to other easy to find documents.

Assuming that you have written (in your opinion) the Worlds Greatest Recovery Plan, make sure that someone else gets to appreciate your good work by actually getting them to put it to the test and ultimately give it a quality assured stamp of approval. Choose your most junior member of the team and if they struggle to achieve recovery without asking questions or are doing something wrong, then either your documentation is not fit for purpose OR they need further training. In any event, you should always look toward the documentation as being imperfect before you assume that your Junior DBA needs a brain transplant. Remember that they wowed you in that job interview, so the likelyhood is that you (or your documentation) is at fault.

A final point worth mentioning on the above is that when your most Junior DBA is given the task to perform recovery, make sure that your most Senior DBA is given the task of shadowing them. Make sure that both parties understand that no help will be allowed and that the Senior DBA is simply there to protect the Junior from themselves OR the poor documentation. Remember to emphasize  that the documentation is being tested here NOT the Junior DBA.

10. You have no 24×7 Support for your 24×7 Operations.

How many of you these days work within an oncall rota? That’s great isn’t it? The main problem with oncall is that every team will have ever so slightly different arrangements and understanding about what *exactly* the oncall rota really means when you are oncall. Furthermore, since there is nobody actively watching and monitoring the systems availability, by the time you get to hear about a problem, several hours will have already passed -so much for your Highly Availability Service Level Agreements!

Usually an even bigger threat to recovery of your systems during your oncall hours is the time it normally takes to mobilize all the necessary teams to fix the problem. That is, if you have even managed to identify what *is* causing the problem. Communications across teams seems so much harder and takes so much longer when you should be getting your beauty sleep….

You may now be thinking that I am suggesting off-shoring your night-time support operation? Personally I would only ever suggest doing this if your offshore support have the knowledge and capability to actually fix problems themselves when they happen. If all they are there for is to escalate problems to you when issues are seen, then all you have achieved is added yet another element of perplexity to your support.

Bonus: You are at the mercy your outsourced service provider.

I could tell you stories about my bonus point that would chill you to the bone. But I shall spare you from the horror and simply say that if you are fortunate (or unfortunate) to outsource any of your IT service or infrastructure to a managed service provider, you had better make damn sure they can deliver on any promises that have been made within your SLAs, RPOs and RTOs. If you ever need a new server or new SAN provisioned instantly for whatever reason, can they deliver it within an acceptable time frame? No, of course they bloody can’t!

Have you even bothered to formulate specific SLAs, RPOs and RTOs with them? No I thought not….


book_nowWell thank you for taking the time to read my list and I hope you have enjoyed them. If you fancy immersing yourself further on SQL Server High Availability, then why not check out my Training Day page and book yourself into an up and coming intensive (but fun) day of HADR?