Tag Archives: Availability Groups

10 reasons why I HAte you!

(or My Top Ten High Availability annoyances)

Danger lurks...

Designing, planning, deploying, administrating and recovering (known from now on as DePlanDeAR – and pronounced in the tone of a Caribbean Grandma 🙂 ) a SQL Server High Availability and Disaster Recovery (HADR) solution is really not an easy thing to implement and maintain. There are many reasons for this and whilst (ultimately) our business is only interested in an almost permanent unbroken connectivity to the Database Engine for its Applications or Middleware Clients, the reality is that our databases/s play just a very small part of the whole availability story when maintaining one of these solutions.

The skills and knowledge required for DePlanDeAR for HADR solutions generally span many different teams, require many different Subject Matter Experts (SMEs) and ideally at least one person to communicate and co-ordinate between them with enough insight and direction to achieve and maintain a robust solution. More often than not, I arrive on client sites only to find High Availability strategies that were thought to be fit for purpose when in fact the opposite is usually true.

In this article I am going to list ten of my favorite (if that is the correct term) reasons why I hate YOUR (yes your!) HADR strategy.


1. You have no idea what Quorum Model you are running under.

Quorum is the mechanism used by your Windows Cluster Nodes and determine which can be considered part of the running cluster or which have lost connectivity. In this way Quorum aims to prevent split brain scenarios in which connectivity between nodes is lost. By default each node has one vote which makes up a Quorum maximum and a node will need to see a visible Quorum majority in order to continue running in the Cluster. A loss of a visible Quorum majority by a node would cause it to go offline (important to note in this scenario the Server itself does not shutdown!) and any cluster resources currently owned and running on it would fail over. Therefore as you might guess, Quorum is one of the most important concepts in Windows Clustering and is used by both SQL Server Failover Clustering and AlwaysOn Availability Groups.

Why then, do so few IT professionals from Windows Admins to Database Administrators still fail to make the effort to find out what exactly your Windows Cluster Quorum Model is? Time and time again I have seen the legacy Disk Only Quorum (a throw back from Windows 2003 and earlier and is a single point of failure) configured in Clusters that are not just two nodes but consist of many nodes. There is no excuse, not in any situation to use Disk Only Quorum these days. You should ensure that:

  1. You understand the concept of quorum and the effect it has to your cluster’s availability, regardless whether you are a DBA or Windows administrator.
  2. Ensure it is changed immediately to a more appropriate model!

There have been significant Changes to Cluster Quorum in Windows 2012 and further advancements in Windows 2012R2. These changes will help make Quorum configuration and Cluster availability significantly easier and more efficient. I shall cover these another time but for now you can read a bit more about Quorum in my post “Weight doesn’t ALWAYS have to be AlwaysOn“.

2. Your Windows, Network, Storage and Database teams work in isolation.

There are not many technologies that require the crossover skills in the way that SQL Server High Availability solutions do. Not only do you have to worry about SQL Server functionality itself (and let’s not forget that it does help if you understand some of these HADR subjects quite deeply to avoid their nuances), but you also need to have a good understanding of Windows, Active Directory, Networking and SANs to name a few others. Probably one of the most common scenarios I encounter is teams working in Silos and communication across them being very poor. There is only ever one outcome to silo based designed HADR strategies -they ALWAYS result in bad designs, bad implementations and unconfident and ill-informed support teams.

Having technical cross-over is good. It gives you perspective, an appreciation of another team’s challenges and the ability to communicate in their language. Nobody ever said you couldn’t specialize in your area of choice and be a technical expert did they? You won’t forget your existing skill set just because you have learnt something new. No, it will give you a foundational platform to build your knowledge. The dots start to connect and you smarter!

3. Your Entire Production infrastructure is all LIVE!

So you have implemented a HADR strategy and all seems to work well right? Eventually you will make it live and run with success for a period of time, giving yourself a big pat on the back. Your fantastic design exists purely because you are amazing and no one else could have achieved such a feat of engineering!

Eventually the time comes to patch your Windows or SQL Servers. Then and only then do you realize that your solution requires you to deploy these to the systems that are currently running as LIVE. This mistake is more common than you would believe and your coupling between your “DR” solution and your “HA” solution is so tight that in order to patch anything in your Disaster Recovery site you have to initiate failover in Live! Sometimes it is possible to get around these situations by temporarily workarounds (such as breaking SAN replication and re-establishing later) but most probably your design only accommodates failover.

If it needs saying, ALWAYS try to decouple any strategies you implement as far as possible. There is nothing wrong with using complementing HADR technologies as long as the use of one does not compromise another.

4. You do not run similar HADR infrastructure in your (cross where applicable) UAT/ QAT/ Systest and Dev environments.

In most organisations, Highly Available strategies are only seen as something worthy for production. I have been lucky to work for an organisation that employed nearly 100 developers and yet even there, almost zero thought had been given to the Development environment’s availability. After I spent some time calculating the cumulative cost of man hours that would be lost in the event of any of the development database servers failing, it was a fairly obvious thing to suggest to them that:  in this scenario their Development environment was more important than Production!

What is more, running similar HADR deployments in other environments allows your Developers to design code that is more likely to be suitable for these platforms, allows your Testers to find platform related problems *before* code hits production and empowers you to accurately trial changes before risking doing so in live. I could go on but…

5. Your management think HADR is easy.

Your DBAs understand SQL Server and your Windows Administrators understand Windows? You might almost be as bold to suggest that in each area of specialism, there are some real experts in those teams. Unfortunately HADR implementations span a whole stack of technologies and skillsets ranging from the obvious (Log Shipping) to the not so obvious (and seemingly unconnected) such as SAN replication or Virtualization. Understanding how all of these offerings can be used and knowing how they interoperate and play with each other can be bewildering, even for SMEs. Yes HADR will take you out of your comfort zone, but you will learn a lot from travelling to it and provide more robust solutions and more stable systems.

Remember to explain this necessity to management and make sure you can help them understand the obvious! If you need training, then explain to them why.

6. You think HADR is easy.

You have been using HADR strategies for quite some time now and many of your peers believe you to have almost Jedi like skills. Heck you may have even started to believe your own hype and think you have every base covered.

…If only things were as simple as that.

It is always important to try to eliminate any single point of failure where possible in any HA solution, but there are too many variables to address when considering your designs. It is impossible for you to ever understand the impact of an Operating System (or firmware) patch to various parts of the environment, but ultimately one day your strategy is going to fail. How long the recovery from the system outage is going to take will depend (in part) on how well you actually understood the solution that was implemented. Having administered a working system over a very long period of time with no failures or downtime does not mean you are capable of restoring operations back to normal should they now go belly up.

Do you have confidence that (if you are given point in time sql backups) you would be able to rebuild a system from the ground up in a disaster scenario within the expected service agreements? If not, then you are running at risk.

7. You are not in possession of an SLA, RTO, RPO or system definition.

When failure or disaster strike (and believe me, if they haven’t yet, sooner or later they will), time and again I see people in positions of power, influence or command start flapping and demanding that operations absolutely have to return to a full working service immediately otherwise no Widgets can be sold by ACME Corp, and repercussions will be serious! Yet these highly charge stressed individuals are the very same people that you have approached on numerous occasions to ask for your Sytems Recovery Point Objectives (RPO),  Recovery Time Objectives (RTO) and Service Level Agreements (SLA). With a shrug of the shoulder they calmly tell you that there isn’t yet any defined agreements but they are “working on it”. Or perhaps even worse, they have given you documents which whilst defining what the RPO, RTO and SLA is for a particular system, they fail to DEFINE the system.

I have often seen people responsible for Business Availability go to great lengths to define the agreements for RPOs, RTOs and SLAs for particular Systems, but fail miserably in defining what *exactly* constitutes “The System”. Every Business system is composed of many different moving parts and subsystems, both technical and non technical platforms. All of these things (as we have discussed already) are generally supported by groups of diverse teams that rarely communicate between themselves. At a higher level there are Business processes sitting on top of these platforms that will have their own nuances and quirks and require specialist knowledge.

Therefore is your “System” the entire thing that is being described above OR are you going to break it down into component parts for your availability agreements? Do you even know whether it is possible to meet objectives *if* you were forced to run recovery in a serial nature (which is so often the case in situations like these). You may find there is not enough time…

8. You do not regularly review OR test your HADR strategy.

Your HADR plan is only as good (and no better than) the competency of everybody involved in its design and those who will execute that strategy in the event of failure. Throw into the mix a whole host of ever changing variables, technologies, services and business processes and suddenly you have a moving target to worry about. On too many occasions I have been witness to scenarios in where “The Business” would never allow a regular fail-over policy and believed that any solution currently in place would (if called upon) just work. Your problem is this; the longer it has been since you last tested your HADR plan/s, the more likely your moving targets will have compromised your solution -right? And if you agree with me; it is far better to experience a failed HADR plan when you don’t have to rely on it than when you do.

It only makes sense then, to regularly review your strategies and try to reduce the risk of failures at any time, whether they occur through a managed test or because of an unseen event. I should also widen the scope further and say that if your company has a solid set of change control processes and procedures in place managing (and publicising) changes across your Enterprise, then it is far more likely that your HADR reviews are going to flag potential issues.

Nuff said!

9. You have no documentation (or your documentation is worthless).

By now it really should be self explanatory that if you do not have any documentation for your recovery strategy and these plans only exist in yours or someone elses head then you are destined to run into big trouble on failure. But more commonly documentation will exist, but it is unnecessarily large and difficult to follow. Maybe you wrote it with a baffoon in mind, but honestly, you do not have to describe in gory detail how to do operations that your specialist technical staff should be able to perform. If you are trying to document an operation such as “Restore database AcmeCorpBigDB and all logs including tail backup with norecovery from the most recent taken on Production server AcmeCorpProd1 to AcmeCorpDR1” then that is all you need to say. You do not have to explain which buttons to press or go into detail about how to do it in TSQL versus a GUI based restore (or even use that funky font you have recently discovered to make it look nice), just get straight to the point. Putting sidenotes that might assist in speeding up the process is just about acceptable, but any detail (for dummies) should be referenced through footnotes to other easy to find documents.

Assuming that you have written (in your opinion) the Worlds Greatest Recovery Plan, make sure that someone else gets to appreciate your good work by actually getting them to put it to the test and ultimately give it a quality assured stamp of approval. Choose your most junior member of the team and if they struggle to achieve recovery without asking questions or are doing something wrong, then either your documentation is not fit for purpose OR they need further training. In any event, you should always look toward the documentation as being imperfect before you assume that your Junior DBA needs a brain transplant. Remember that they wowed you in that job interview, so the likelyhood is that you (or your documentation) is at fault.

A final point worth mentioning on the above is that when your most Junior DBA is given the task to perform recovery, make sure that your most Senior DBA is given the task of shadowing them. Make sure that both parties understand that no help will be allowed and that the Senior DBA is simply there to protect the Junior from themselves OR the poor documentation. Remember to emphasize  that the documentation is being tested here NOT the Junior DBA.

10. You have no 24×7 Support for your 24×7 Operations.

How many of you these days work within an oncall rota? That’s great isn’t it? The main problem with oncall is that every team will have ever so slightly different arrangements and understanding about what *exactly* the oncall rota really means when you are oncall. Furthermore, since there is nobody actively watching and monitoring the systems availability, by the time you get to hear about a problem, several hours will have already passed -so much for your Highly Availability Service Level Agreements!

Usually an even bigger threat to recovery of your systems during your oncall hours is the time it normally takes to mobilize all the necessary teams to fix the problem. That is, if you have even managed to identify what *is* causing the problem. Communications across teams seems so much harder and takes so much longer when you should be getting your beauty sleep….

You may now be thinking that I am suggesting off-shoring your night-time support operation? Personally I would only ever suggest doing this if your offshore support have the knowledge and capability to actually fix problems themselves when they happen. If all they are there for is to escalate problems to you when issues are seen, then all you have achieved is added yet another element of perplexity to your support.

Bonus: You are at the mercy your outsourced service provider.

I could tell you stories about my bonus point that would chill you to the bone. But I shall spare you from the horror and simply say that if you are fortunate (or unfortunate) to outsource any of your IT service or infrastructure to a managed service provider, you had better make damn sure they can deliver on any promises that have been made within your SLAs, RPOs and RTOs. If you ever need a new server or new SAN provisioned instantly for whatever reason, can they deliver it within an acceptable time frame? No, of course they bloody can’t!

Have you even bothered to formulate specific SLAs, RPOs and RTOs with them? No I thought not….


book_nowWell thank you for taking the time to read my list and I hope you have enjoyed them. If you fancy immersing yourself further on SQL Server High Availability, then why not check out my Training Day page and book yourself into an up and coming intensive (but fun) day of HADR?

Remember your OS when upgrading to SQL 2012

Probably as far back as the end of last year when I first started putting material together for my Enter the Dragon and Moves like Jagger presentations and giving very early mash up presentations of them combined together, I started emphasising one very important upgrade consideration which is very often ignored. It is also a fact that is quite nicely touched upon by Glenn Berry (twitter|blog) in his excellent book SQL Server Hardware and if you have yet to discover its unique delights then I strongly advise you to check it out.

The consideration I am talking about is of course the mainstream support dates of our Hardware and Software but most specifically the Operating System (in our case Windows) which is the one of most concern.

Spot the problem?

During my Moves Like Jagger – Upgrading to SQL Server 2012 presentation (most recently given to 24HOP) I have discussed that should you decide to upgrade to SQL Server 2012 then you should watch out for the mainstream support end dates for Windows 2008 and Windows 2008 R2. If you are installing SQL 2012 to Windows Server Core then your only consideration anyway is Windows 2008 R2 (SP1) otherwise using Windows 2008 (SP2) is a supported Operating System. However from a mainstream support perspective the version of Windows you use is kind of irrelevant since (as you can see in the slide above) each expires on exactly the same date -surprisingly mid July next year!

Yes people you heard me right, should you choose to upgrade your environment to SQL 2012 today, tomorrow or have already done so, then I hope you have taken this into consideration -for if your company are not prepared to pay for Extended Support then you could be looking at yet another upgrade (this time the OS) sometime next year. This scenario is one very strong reason why if you have implemented SQL 2012 in an Availability Group or Clustered configuration you can probably give yourself a pat on the back right now. It is going to simplify your OS upgrade and maintain a level of High Availability where this is required.

Beautiful now, but rebuild costs could be large!

At the turn of the year I went slightly out on a limb by saying that because of the Mainstream Support expiry dates it was my firm belief that Microsoft would release their newest Operating System (I called it Windows 8 Server at the time :)) before mid July and that if you did not have a compelling reason to upgrade to SQL 2012 now, then it is probably a good idea to wait a while.

Well so far my guesstimate appears to be coming true. We now know that the next Server release of Windows is called Windows 2012 (I personally like the fact it is in-line with the Server based Application names) and  today I have seen the following url- Windows Server 2012 Release Candidate Timing. So we know by the very name that the OS is coming this year and at the time of writing all indications point to an imminent release but let me stress that I have no inside knowledge and this is all guesswork so please don’t get upset if the release happens on the 31st of December 2012 -however unlikely!

In summary I believe you should take three approaches to implementing SQL Server 2012 at this present time :-

  1. The first and probably my most recommended option would be to wait for (hopefully only) a few more months if you can until the release of Windows 2012.

The second and third options are to plan for the Mainstream Support end of Windows 2008/R2 by :-

  1. Arranging and paying for Extended Support.
  2. Implementing SQL 2012 in an HA configuration that enables you up blow away your base Operating System and start again (for instance using AlwaysOn or even Peer to Peer Replication are possible routes you can take).

Whatever you do, don’t forget the obvious …and remember your dates.

Weight doesn’t ALWAYS have to be AlwaysOn

One thing I keep hearing myself mentioning more and more in conversation (and most recently in a discussion group at SQLBits a few days ago) is the ability to configure your Windows Cluster Quorum for situations where each cluster node may not be of equal importance.

In SQL Server 2012 we have various high availability enhancements and improvements and most of these are branded under the term AlwaysOn. In particular we have enhancements to AlwaysOn SQL Failover Clustering and a new technology known as AlwaysOn Availability Groups. Whilst I won’t bore you with any specific details about what these are or how to use them -since that is no doubt something for another day, you may probably be already aware that for both, each require the existence of a Windows Cluster in order to use them.

One of the biggest reasons why Windows Clustering has been adopted as a pre-requisite for AlwaysOn Availability Groups is to use the mechanism known as the Quorum that provides a voting mechanism to determine which nodes can collectively be considered to be still “alive” and which nodes can be singularly or collectively be considered to have “failed”. This mechanism is a very important concept to AlwaysOn since it prevents the classic split brain problem. With respect to Availability Groups it also means that the Database Witness server that is used for Database Mirroring is not needed to implement this technology and because of this, is more scalable and reliable than it otherwise would be. *1

Quorum is not so CUTE!

As you may also be aware, in Windows 2008/R2 there are four types of Quorum model that you may use and these are:-

  • Node Majority -where more than half the nodes must remain active.
  • Node and Disk Majority -where the disk acts as an extra vote and that more than half the votes must be available.
  • Node and File Share Majority -where the file share acts as an extra vote and that more than half the votes must be available.
  • No Majority: Disk Only -the disk quorum must remain available.

In all cases, should the required situations not be true, each Windows Cluster Node’s Cluster Service will automatically shutdown since “Quorum” is lost.

In situations where the total number of Cluster Nodes is an odd number then you would traditionally use the Node Majority Quorum model so that you could lose a total of half of your cluster nodes minus one before having a cluster failure. Otherwise the other three Quorum models should be considered (generally Disk Only Quorum could be a valid option when the node count is one or two but should otherwise only be considered in special cases).

What is not commonly known is that a Cluster Node does not HAVE to have a quorum vote. By default they do, but it is possible to set what is known as Node Weight to zero. Before we come onto that though, you are probably wondering exactly why you would want to do this? Well there are several scenarios that make this a desirable thing to do such as in situations where you have implemented a Geo-Cluster (a Cluster across Geographic locations).

Consider the following diagram :-

AlwaysOn FCI or AG nodes across sites

As you can see we have Site A and Site B each hosting a selection of Cluster Nodes. It is possible that each node might be hosting a standalone instance of SQL Server and is simply part of a Windows Cluster because we are using Availability Groups (so each can house a Replica) OR it might be that we have implemented SQL Failover Clustered Instances perhaps because that we require the ability to fail SQL instances across sites.

Now in the scenario we propose (and assuming we choose a Quorum model of Node Majority) the total number of Quorum votes equals five. Imagine then that we suddenly have a failure of Node C. Since the number of votes would then equal four we would still have “quorum”. Now consider the prospect that we lose connectivity to Site B. This event would not only cause the Cluster Service for nodes on Site B to shut down (since their total votes of two would be less than the required majority of three) but would also cause the Cluster Service for nodes on Site A to shut down since there would also only be two votes available.

Although the Cluster Nodes on Site A could be forced to start, we have unfortunately lost availability (however temporarily) and more importantly requires manual intervention. Perhaps a slightly more elegant solution would be to set the Node Weight of  Node D and E to zero meaning that nodes A,B and C are the only ones that can cast a vote to make quorum (making a total Quorum count of three). In the event of a loss of one node on Site A and a loss of connectivity to Site B, two voting nodes will still be casting a quorum majority thereby keeping the Cluster available.

So lets now move onto how you are able to set your Node Weight for your Cluster nodes. Unsurprisingly we can set the Cluster node “Node Weight” through a property called “NodeWeight” but this is not accessible by default. This can be demonstrated by using the following PowerShell script (first we must import the failover clustering module):-

Import-Module failoverclusters
Get-Clusternode|Format-Table -auto -property Name, State, NodeWeight

We get the following result:-

Name         State NodeWeight
----         ----- ----------
wonko           Up
wowbagger       Up

However upon installing Hotfix 2494036 (which I should add requires a reboot to take effect) this makes the NodeWeight property accessible:-

Name         State NodeWeight
----         ----- ----------
wonko           Up          1
wowbagger       Up          0

As you can see from the above, I have already set the NodeWeight of wowbagger to zero and I did so by running the following PowerShell command:-

(Get-ClusterNode "wowbagger").NodeWeight = 0

Before you get all Jackie Chan on me and set some of your Cluster Nodes node weights to zero you should first seriously sit down and draw up a design strategy as to whether this makes sense to do so. In the scenario I proposed “the Business” had stipulated that a loss of Site B or any of the Cluster Nodes within it should not in any way effect the availability of the primary Site A but by doing so we reduce the total number of possible failures on site A to a maximum of one failure in a Cluster containing five nodes! Therefore be very careful and cautious and only when necessary remember that your Cluster Node weight doesn’t always have to be AlwaysOn.

*1 This is a contentious issue for some (including with me) since unlike with Database Mirroring, AlwaysOn Availability Groups (since it uses Windows Clustering) requires that a single Active Directory Domain spans each Geographic location.

Lets keep it short – problem creating availability group listeners

Someone once said to me that he thought I should work as a GUI tester since I always seemed to find problems in the UI after I highlighted a problem to him that I’d found in Linux Gnome that he had never seen before after all his years working with the product. I guess there is an element of truth to that statement since I really do like clicking around and trying different things. However I suspect that my latest discovery won’t necessarily be one of those obscure problems that no-one else comes across since it is such an obvious problem and fairly easy to run into.

Whilst playing around with SQL Server 2012 (Denali) Availability groups, I discovered that there is a restriction set on the size of the listener name that you are able to enter through the GUI.

Let me demonstrate…

First I will create a new DNS host name called “readpastandfurious” which you should note is 18 characters long.

create our “longish” DNS host record

And of course there isn’t a problem creating this record since the restriction upon the size of the DNS host name is (I believe 24 characters).

DNS Server will now resolve our host record

Next I decide to create an Availability Group Listener via the SSMS GUI wizard and am very surprised when I can only type in 15 characters! Obviously this text field must have been programmatically restricted since by default the GUI text fields would not be limited.

oh dear we are out of space

My first thought was that it has been restricted on purpose perhaps for an odd NETBIOS reason -if for example NETBIOS was being used for resolution in any way in addition to DNS (daft but possible). Microsoft limits NETBIOS names to 15 characters, so this might have been a plausible explanation.

Next I decide to try and create the full 18 character availability group via T-SQL and I am a little surprised to find that it succeeded with no problems at all!

ALTER AVAILABILITY GROUP [READPAST & Furious Optimistic]
ADD LISTENER 'readpastandfurious' (WITH IP ( ('10.0.5.1','255.0.0.0') ) )

but as I suspected, TSQL is hunky dory

My conclusion then is that this behaviour is almost certainly a bug and was introduced by a programmer who either doesn’t understand the difference between NETBIOS and DNS or blindly assumed that 15 characters is enough?! In any event I believe this is a problem and have therefore raised a connect item for this issue here give it a vote up if you believe you might fall prey to this issue at some point in the future!

T-SQL Tuesday #19 – Your mistake… my problem


This month’s T-SQL Tuesday is brought to you by Allen Kinsel (blog|twitter) who contributes a great amount of his time to the SQL Community and so we can only say a huge thank you for your efforts and keep up the good work! I’ve not been able to participate in T-SQL Tuesday for a while so it is good to catch this one which is made easier by keeping a close eye on what Rob Farley (blog|twitter) is up to because he is always on time every time! Here’s my entry…

Many years ago I started a position at a company that did not understand the value of testing whether a SQL backup was valid. Within my very first few days a problem was discovered in a very important database holding extremely important information and on inspection there found to be serious corruption throughout.

An existing DBA had looked through the archives of backups from tape for this database going back to around three months and there was not a valid backup to be seen (realised after numerous test restores). This was clearly not good and by now the business had been informed, and even more importantly were aware that the problem had (for some reason) been handed onto me! I guess I should have been flattered by this gesture of faith in my abilities, but it felt very much like passing the buck to the new guy. And the buck was firmly in my lap.

Fighting back the Business

In scenarios of corruption on a database I always try not to work on the source material (or even server) if space and resources allow so the very first thing I did was to close down and disable all activity to this problematic database and take a reference backup. Once a reference copy had been taken I could now use it to set about restoring several databases on a different server so that I could have a purely faulty copy and a fixed copy. Therefore on the first copy I ran through a CHECKDB (without fix) and used this to identify *where* exactly the problems lie and *what* exactly I was going to do to repair them. The situation was bad I’m afraid, but at least localised in one particular table but seemingly throughout it -from top to bottom. Whenever I tried to run a SELECT statement to query this table it would fail very quickly.

The next thing I tried was to attempt to skip over the problem rows or pages by restricting the query using a WHERE clause upon the primary key which allowed it to continue to return some rows until failure at the next point and so on. I then had an idea that I could try to reconstruct the table by querying only the data that was good and worrying about the data that was bad later. So the next thing to do was to query the entire primary key column alone in a SELECT query and I was pleased to see that this entire column was good. Upon querying each of the other columns in the table I determined that corruption was not specific to any one column.

My plan therefore was to build up and re-materialize the table by querying it in a checkerboard style using a multi-cursor based solution so that if corruption was found upon a row then the column that caused the error would be skipped over in the next iteration and this worked like an absolute charm. What it did mean of course was that I had resulting holes in the table within specific columns of specific rows, but from a table consisting of millions of records I only now had problems on several hundred rows. The next step was to repair the second copy of the database and for all the datapages that were repaired successfully without data loss I could again use this data to fill holes into my new table where possible.

The problem rows had now reduced in number to around 14 or so, and it was decided to refer to a heavily updated database copy of this production database that had been refreshed to UAT environment several weeks previous and determine whether or not these problem rows had in fact been altered. Our investigations gave us confidence that they hadn’t and so these final pieces of the jigsaw were used to fill the final gaps. The Business gave the solution sign off after checking through the data and solution and we put a repaired database with our new table back into production. A success for me and the Business and a lesson learnt for the team in terms of the importance of ensuring backups ARE validated.

Sometimes you’ll find yourself in situations that have arisen through no fault of your own. Remember that you can’t always choose your disaster scenarios or even do anything to prevent them. A disaster is a disaster no matter who caused it or who could have prevented it. When disaster strikes it is important that you are prepared to stir into action, remain calm, focused and most of all try to be inventive in situations that could benefit from this approach.

Good luck!

SQLBits 8 – finis!

Nearly two weeks has passed after another SQLBits and what another fabulous event. After an incredibly early morning train journey down to Brighton for the excellent Thursday training session with Thomas Kejser and an enjoyable time presenting my “SQL Server Clustering for Dummies” session I am well and truly back down to earth with a thump.

Personal highs for me were :-

  1. Finally getting to meet Allan Hirt and not only that but I had a Geo-Cluster debate with him in the speaker room 🙂
  2. Meeting a few of the new first time SQLBits speakers -in particular Atul Thakor (blog|twitter) who I had a very long chat with in the speaker room and on the morning before his presentation. He had run into lots of problems with his laptop and demos prior to his session -which can strike the best of us, but managed to keep his wits about him an plow onto success when he finally came to present. What a top guy, and look forward to meeting him again in a User Group or another conference.
  3. Recognizing some audience members in my Clustering for Dummies session from my SQLBits 7 session. I ran into a couple of them later on and said a big thanks for the support, gave them my personal email and very much hope they do drop me a mail sometime to say hello.
  4. Bumping into some old faces…people I’ve met in Seattle, previous UK conferences and User Groups and old colleagues. A special big-up to Judith Kinsella who made the effort to come to my presentation (which threw me when I saw her in the audience), although she keeps telling me that I remind her of John Peel and just can’t take me seriously 🙂 🙂
  5. Being able to enjoy the experience a little more, due to familiarity of proceedings. Anyone who presents at a major conference for the first time will tell you that it is all a bit of a daze, stressful and is a big unknown what you have let yourself in for. This time I knew what to expect, what I was doing and what I needed to do.
  6. Seeing some great presentations.
  7. Excellent weather!

In full flow!

With all the good things going on their were a few disappointments, such as :-

  1. Not being able to spend a drunken Friday evening celebrating due to my early morning (8.10!) session. Actually this was probably for the best, and although I am a bit grumpy in the morning I felt perfectly relaxed (probably too tired to get nervous).
  2. Spending hardly any time with my SQL friends! I did at least manage to catch up with them from time to time in the speaker room so that was nice to have a couple of chats.
  3. Not going to as many sessions as I intended. I don’t know if every speaker is the same but due to the huge amount of time and preparation (and worrying) that goes into a brief 60 minute period I find it quite odd when it is all over. Sometimes, all I want to do is let your head clear for a little while and that is exactly what I did. I dropped my stuff off and met up with a friend for some excellent Eggs Benedict and strong black coffee in lovely sunshine outside of a Cafe in the posh end of Brighton.
  4. Getting a little too plastered on the Thursday night. When you are waking up with a massive hangover you have probably gone too far the night before …and yes people I probably had. The only success from all this was the fact that I’d felt myself feeling rather ill and stop drinking alcohol, otherwise I hate to think how I would have felt the next day.

Always very flattering to be asked more questions…

To summarize though, another fabulous event and I can’t wait to see where the next one will be. My personal preference is for it to be held in Edinburgh, although I would have liked that one to be during Summer months. So perhaps the favorite right now would be Bristol which should be a little warmer than the rest of the country around that time.

Roll-on SQLBits 9!