Category Archives: HADR

Generate your cluster logs

Windows and SQL Clustering can be a tricky business, but fortunately most of the administration and troubleshooting can be performed through the GUI. Obviously the same argument can apply that the simplicity of any GUI can be it’s downfall in the sense that it opens up something complicated to the inexperienced. This can be a recipe for disaster on Clusters but as mentioned in previous posts is really down to your IT structures to restrict access to them.

Beginner to moderate level Clustering administrators may not be aware of this, but there is much you can (and sometimes must) perform from the command line. In day to day running activities there is usually no need to do so but knowing that various possibilities are available to you from the command line can be an absolute life saver. I also want to add that there have been numerous posts recently from various professional’s blogs about managing Clusters from Powershell, but this is not what I am discussing. The commands I refer to are the native executables that are installed through the setup of Clustering.

I recently came across a great post about one such command, thanks to being pointed in the direction of Joe Sack from a tweet by Gavin Payne (blog | twitter). I immediately noticed the post from Joe (seeing the word Clustering always catches my attention) and decided to take a look and encourage you to check it out. He effectively discusses generating the Cluster log files for Windows 2008 Clusters so I decided to try it out myself quickly. Trying is the only sure fire way to remembering in my humble opinion!

Running the command will generate all the logs for each Clustered node in your specified location. Should any node be offline then the command will simply generate an RPC error for that node.

So the first thing you need to do is run the command to generate those logs.

Once all those lovely logs have been generated (it appears to be one per node) you can review them for your enjoyment.

As you can see when a problem hits, just knowing how to do this could aid you in your quest to solve the issue much faster than otherwise. Hopefully in future posts I can have a look into what other things you might want to use the native command line for when configuring your Clusters.

Cluster aint misbehavin’

One of the great things about SQL Clustered Instances is that they generally give a much greater possibility for High Availability which is obviously the whole point of them. However one of the more commonly overlooked considerations is the fact that by their very design you are introducing extra complexity to your environment and this can be counter-productive and work against what you are trying to achieve in the first place. For instance if you allow unqualified and inexperienced people to access and even administrate your SQL Clusters then you could be in for a world of pain. I have seen it time and time again and yet those kind of classic mistakes seem to keep being allowed to happen. In order to highlight this point further, imagine your sysadmins, operators and DBA’s going about their everyday management of systems oblivious to the fact that what they are working on is actually a clustered node. They could be creating a Network Share, stopping a Service or even installing software without realizing what they are doing and the potential risk they are taking. It is for this very reason that I believe that Windows Clusters in particular require a specialized team in order to support them because we want to maintain the High Availability don’t we folks?

Another area of consideration is the issue of problem resolution. Your management team will think that just because you now have a shiny new Cluster you will never have any more problems or outages on these systems but this is obviously wrong and can add to the stress levels and expectations when trying to get a system back online. Also you should consider that resolving an issue can get a whole lot more difficult since your resolution must take into account what you are resolving. Making a change on a Clustered node can have a knock on effect to the whole of your Cluster configuration and secondly what you would have changed in a standalone environment may not even be possible in a Clustered Environment. So remember you are on a harder learning curve and NEED to remember this when disaster strikes because it WILL take you longer to fix and the worst thing you can do is panic.

A case in point happened to me the other day with a SQL Cluster that has effectively been open to every man and his dog (don’t ask -it was political decision from long ago). I was experiencing very unusual behavior from the SQL instance whereby the Clustered Instance actually appeared to be coming online but when you attempted to connected to it via SSMS and expand out the Databases node you were presented with the hourglass forever. Next step was to connect via SQLCMD and various queries could be ran, even a list of the databases went through OK which did indicate that various user databases were “Recovering..” but if I tried to filter this list or even sort it then the prompt would just hang. I couldn’t even perform an sp_who2! Error logs did not really highlight anything obvious.

My biggest problem with all this was that I really wanted to know whether it was a Clustering problem or an Instance problem and it was hard to decide since I didn’t really have much to go on. I decided that it must be the latter though based solely on the fact that the behavior being displayed appeared to be DB engine related and secondly would carry over to the next node on failover. Usually when the problem is Cluster based, the resource in question will not start. I personally believed that something was wrong with my System databases but I had no warnings and they were all coming online with no problems. Trying to run a CHECKDB against the master database just appeared to be doing nothing -exactly the same behavior as some of the other commands so this didn’t really tell me anything, I let it run for a while nether-the-less. Once convinced that this was a waste of time I decided that enough was enough at to start using “Gorilla Tactics”.  First I decided to check the integrity of the disk that housed the system databases and was surprised to find there was a problem. DBA’s please take note…I have very rarely ever seen another DBA check file system integrity (and many sysadmins too come to think of it) but it is an incredibly useful thing to check yourself and normally always fixes odd behavior with various apps should you find any. For instance a common one in SQL Server is where backups to disks are failing for no apparent reason. So once I detected that there was a problem in that area, I took the Cluster Group’s SQL Services offline and also checked the remaining disks but they were OK. Next, once all file handles were closed to the drive with issues I then ran a fix.

filesystem corruption

I fully expected that to solve the problem having seen file system corruption be the cause of many a problem in the past, but after bringing the Cluster Group’s SQL Services back online the same strange behavior remained! No more Mr. Nice Guy now I thought and decided that something really must be wrong with the master database. I Decided that a restore of master database was needed but the issue was how to do this on a Clustered Instance? For example there was no obvious place to enter any startup parameters in the SQL Server resource to start it in minimal configuration mode -although I could have sworn I’d done this before on previous versions of Clustering (this one was Windows 2008). Attempting to directly change the Service parameters did not work either, they were not remembered. So what to do now?

I decided that it would be easier to start the instance through the command line but wondered whether this would work since we have to specify the SQL instance name (ours was the SQL Failover Cluster name) and in reality we were starting the local SQL Server Service on this local node. To my surprise, using the Cluster name and instance worked perfectly, the instance came online in minimal mode (and therefore also in single user mode). I ensured that I quickly connected to grab that one and only connection and I was now ready for the restore.

So the restore was probably the easiest part of my whole shenanigans and simply typed out the command to do this, and once completed the instance automatically shut itself down.

A very convenient side effect of running from the command line using different startup parameters meant that I hadn’t actually made any other configuration changes elsewhere, meaning that all I needed to do now was to bring the Cluster Group’s SQL Services back online.

restore master

So what happened you might ask?

Yes everything came back online fine, I connected to the instance very quickly via SSMS and all problems had disappeared. The next step was to ensure I took a fresh backup of all the system databases and once complete I endeavoured to ensure the integrity of each and every system and user database. Thankfully all of these checked out fine, so the last step was to make sure a FULL backup of each and every database was taken and put to one side.

In conclusion, when you are performing database administration to a Clustered Instance, how you do it and what you do depends upon what the Cluster Configuration allows. SQL Clusters are much harder to troubleshoot than standalone instances and you should always keep reminding yourself that just because you know how to fix a problem on one does not necessarily mean that you will be able to do the same thing on the other.

A week at Oracle University

About a month ago I went into the Big Smoke to spend the week at the Oracle University premises at 1 South Place. I was attending the 11g R1 Oracle RAC course and since I am very knowledgeable about high availability solutions, I was especially looking forward to filling in the one last piece of the jigsaw. I had already been on Oracle’s 10g Data Guard course – which for the benefit of you SQL boffins is the equivalent of SQL Server mirroring. RAC of course is an acronym for Real Application Clusters and you can think of it as being similar to SQL Clustering but differs greatly in the sense that an Oracle instance is (or can be) effectively scaled out with the addition of extra nodes and due to the potential processing and memory available can theoretically be scaled up.
Rumor has it that something similar is being worked on for the next major release of SQL Server, but I am really not too sure whether this is fact or fiction. One thing I really liked about RAC is that when it works you can imagine it being really special, however like all of the other Oracle exposure I have had I was again left with a sense of bemusement due to all of the warnings we received such as don’t click this or don’t type that otherwise catastrophe will strike. The classic example of this in the notes was the warning about when you uninstall a node from the cluster to ensure that you do not miss the last part of the command. I think the following was the offending command :-

./runInstaller –updateNodeList ORACLE_HOME= “CLUSTER_NODES=” –local

… One wrong move and your entire RAC Cluster nodes would have the instance un-installed. Now, can you imagine being in a large investment bank and having a RAC Cluster consisting of 20 nodes and then for whatever reason needing to remove an instance from one node …and then boom you make that fatal mistake. You will almost definitely be marched from your desk to the front door.

This is why (I believe) that in most companies where production is critical to operations, those servers are probably never touched. For instance a friend of mine who works for a financial investment fund recently told me that their Oracle server never ever receives promotions. Everything is currently working fine and so nothing is ever changed. I am sure this is partly out of fear of something being changed for the worse.

Moving back to the SQL world, I could never see an identical situation occurring. A similar one yes, but not to the point where no improvements are ever made because of a fear of breaking something. I think this is partly down to the confidence that SQL really doesn’t hold any really nasty gotchas. Certainly not to the same degree that oracle does. An Oracle DBA I used to work with regularly told me horror stories about disasters occurring which were simply down to the DBAs trusting in the product and product patches, and ultimately being marched off the premises -scary stuff!

Anyway a couple of things really made me giggle on Oracle’s premises. All our RAC nodes used Linux -which is absolutely fine, although it wasn’t apparent at the time whether this was OEL or not. The thing that amused me though was the fact that all client machines were Windows 2000! Yes you heard it right, not Linux not even XP or Windows 7 but Windows 2000. For a company that has such a dislike of another company and for them to not only be using their desktop OS but using a ten year old version, makes them look rather foolish in my opinion.

Another bizarre thing was the Open Access PC’s which allows guests access to the Internet. Firstly they were (again) Windows and secondly and most importantly did not have Java installed, rendering their use to me fairly useless for the Java based web pages that I needed to look at. What is the most amusing thing here you might ask? Well, who now owns Java?!

Returning back to the course itself, I have to say I enjoyed it in varying degrees. Far too much talking and not enough playing/ breaking though – but that’s not the trainers fault, he is just following the coursework. The RAC Cluster that was assigned to my partner and I worked fine for three days, but then one evening, without us changing anything … failed. One node completely went down and the other into limbo! To me it looked like disk corruption from the error messages that were being given but nothing conclusive. The instructor wasn’t sure either and promised a repair and reboot overnight but upon arriving the next day he said that it strangely had come back on its own. My confidence in the product was not high by this stage. Oh yes have I mentioned, Oracle University courses do not use printed courseware now. That is all very commendable you might say, but think about the first thing people will do when they get the chance back at work. This will usually mean more paper usage since Oracle booklets used to be A5 in size and is not your standard printer fare is it. So that’s double the paper usage per person per course. The only reason I can see for them taking this decision is to squeeze even more savings for their profits. It’s all a bit silly really, not one person on the course was happy about this.

Now I think about it, one more thing which I just found utterly ludicrous was the fact that once the course had started, our instructor explained to us the difference between the 11g R1 RAC course and the 11g R2 one. Essentially it boils down to the fact that the differences between R1 and R2 is so great, that you could say that 11g R1 is more like the 10g RAC course and the R2 is an entirely different product! You really couldn’t make it up could you?! I mean, if you compare this to SQL 2008 and the R2 release (or even any major SQL release since and including version 7) I think that if you took training in any one major feature, then the skill learned would have been to a large degree transferable to the next version’s feature. The only exception I can think of off the top of my head is Notification Services, which was new to SQL 2005 and scrapped in 2008. DTS and Integration services is not the same thing and in any event we are talking about a 5 year release difference between those features.

Look Oracle, I’m open to you and your products (in fact I use some of the none database ones), but I have one thing to say to you and hope you go away and have a good long think about it….you could and should do better.

SQL 2008 MSDTC cluster problems, when coordinators collide

Well, today I’ve just finished our (dual) production fail-over cluster (almost). All that is left is to remove one of the clustered drives and carve it into 5 LUNS. The reason being is that in SQL2008/ Windows 2008 it is very easy now to run the MSDTC resource in the clustered groups, in fact, it is recommended but unfortunately drives weren’t made available for this. Why not you might ask? We’ll in a Windows 2003 and SQL 2005 cluster setting up the MSTDC was a bit of a pain. However, it wasn’t too hard to get working using the Quorum drive as a dependency as so that is what we did -this is then shared across all cluster groups. In Windows/ SQL 2008 clustering it won’t let you do this (using Quorum) for obvious good practice reasons as so that option is not open to you. However because doing it properly it no longer a technical issue in terms of difficulty it is better we try and do it properly! The reason I can’t run the MSDTC in one of my existing clustered drives is because I don’t/ can’t get this working using the dual cluster integration (which over the next days or weeks I hope to document in this blog), therefore like the Quorum drive I am going to accommodate this by having separate non mirrored (SAN MirrorView) disk resources within each clustered group that can house the MSDTC.
The fifth LUN is obviously so I can have the same (although ever so slightly smaller) disk resource to put back into one of the cluster groups replacing the LUN that was taken.

The reason why this MSDTC doesn’t seem to work is no doubt in some way related to the cluster group name SID account since once DR fail-over has taken place this is all that has really changed. Anyway needless to say the clustered MSDTC resource fails if using the original one, so due to time constraints I have had to compromise and move one.

Hopefully, in the next post I’ll document and post my screenshots on creating a cluster, and then I’ll follow this up with how I do the dual cluster DR using active/ active on all nodes. I think you are going to like it.

Observations with SQL 2008 Clustering, simplifying the installer

Well these last few days I’ve been really in the thick of it trying to sort out clustering on two new UAT servers running Windows 2008 and get SQL 2008 setup in an active/ active configuration. It’s been a real struggle for several reasons. The first is that I was only given a couple of days to learn, implement and configure due to significant overruns from the Technical Support team and the second is due to the fact of how much has changed in this latest release of clustering.

The most significant difference from a Windows perspective is the simplifying the setup of the cluster. In fact, I almost found it too easy and straightforward to the extent where it started to frustrate. The prime example of this is the way in which the wizard automatically assumes what you want and selects it without even providing pre-install opportunity to change it. This is the case in point with the Quorum drive. Not only did the wizard decide the “most suitable drive” behind the scenes, it didn’t give me the chance to change it and also selected the clustering model that I should be using. Post install it is very easy to change and surprisingly seems to work well, but for a relative expert user of clustering, this is not a welcome change. The previous incarnation of clustering auto-selected a Quorum but provided the chance to change it during the setup.

Within the SQL Server setup, the install/s were fairly simple, but one change that is both good and bad in equal measure is the synchronous nature of the installer. For instance in order to setup the fail-over instance, first, you must install on the first node through the “Create Cluster” option and then once setup on the other node/s adds them through the “Add Node” option. I think this actually works quite well, allows for higher availability of the cluster and possibly is less prone to error – in particular, I am thinking of the SQL 2005 cluster install problem where multi-node install failure is caused by being logged onto more than one of the nodes during install!

The Service Pack installation appears to have been made ever so slightly easier to perform using this technique, and I have to admit that on previous cluster versions my heart would skip a few beats during Service Pack-ing the SQL cluster because of the asynchronous install and it being prone to failure.

All in all, the changes are not too bad (in fact so far a slight improvement) -which unfortunately is not what I would say about Windows 2008. It’s simply ghastly.

Future SQL 2008 Cluster installation, fooling a cluster

Very soon, in the very near future, I shall be embarking upon the building of our latest database utility SQL 2008 Cluster. Hopefully, I will get the time to document bits as I go because it should be rather special if it is anything like previous SQL 2005 clusters that I have been involved in installing.

Let me explain. So you have just installed a SQL cluster and you are now wondering about how on earth you are going to provide disaster recovery. There are many options open to you, but you decide that you still want your DR solution to be clustered AND you don’t want the SQL instance names to change once fail-over has been initiated. Secondly, you wonder about how you are going to ship the databases across to the secondary cluster, thinking about mirroring, replication and shipping technologies. All of these technologies have problems and there are implications with each and in addition you want all hardware to be utilized and not sitting idle.

The solution that I have put in and will been soon to be repeating will involve two physically separate Windows SQL Clusters each using SAN-based disk mirroring, with the ability for either cluster to take ownership and run every SQL instance in existence between the two clusters.
The way this is achieved is by fooling each cluster to believe it has each and every SQL instance installed to it and this works very well because of the nature and workings of the virtual clustered instances. Effectively only the virtual clustered instance/s that have visible SAN disk resources will be able to be on-lined, and of course only the SAN disk resources that are source disks for the mirroring will be visible -so it all works seamlessly. When I come to do this again I will try to get this documented and explained in more detail.