IT Certification and Training

August 24, 2007

Lost Art: Root Cause Analysis

Filed under: education, lostart — Jim Henderson @ 21:43

One of the things I see as a lost art is identifying the root cause of a problem in an IT infrastructure. In fact, this is becoming a lost art in more than just IT – the medical field (in my limited experience) also suffers from this.

A few years back, I decided I needed to do something about what appeared to me to be an allergy problem. So, we went to our local medical center, and the answer I was given by the doctor was “here, try these pills, if they don’t work, try these pills, and if they don’t work, try these pills and this nasal spray together. Good luck!” I didn’t even get a referral to someone more qualified to perform a root cause analysis to determine why I felt so miserable.

A lot of IT troubleshooting is handled in this way as well. Some of it is a necessity, but I’m of the opinion that trying different solutions is the last thing you should do, not the first thing. To be successful in IT, you have to understand the systems well enough to fix the problems that happen so they don’t happen again.

It seems that many people coming into IT are not taking the time or making the effort to understand the systems they implement. This is something that should be taught as part of a certification program, but many cert programs don’t go into this.

When I first started teaching for Novell back in 2003, I was hired to teach the eDirectory Advanced Technical Training course. Before I taught that class, though, I had to learn a new course that had just been released, course 3007: eDirectory Tools & Diagnostics.

This course, I think, is one of the best certification courses that I’ve ever seen. I might be slightly biased, because it was the first course I taught – and was a course that taught something that I’d been doing for many, many years at that point: How to find the root cause of a problem and diagnose it.

So, just like the doctor I saw years ago, many IT professionals learn how to try several things until one of them works. In many cases, the “magic bullet” is to reboot the system and wait for the problem to happen again.

“Troubleshooting” in this way requires very little thought, but ultimately defeats the purpose of why one is hired as a professional in this field. Being professional, in my mind, means having a brain and knowing how to use it.

Now, why do I refer to troubleshooting as a “lost art”? Certainly, there is a fair amount of science (or application of scientific methods) involved in troubleshooting. Narrowing the problem down to a specific cause is a very methodical process, and done right, the same process can be used consistently and return good results.

Following the medical analogy I started with, though, there’s a bit of science involved in a diagnosis, but there’s also the application of knowledge that sometimes happens in a nonintuitive way (at least to the untrained observer) to identify what the actual root cause is. A few weeks ago, I was watching an episode of some crime program, and the beginning was a guy walking into a veterinarian’s office with what seemed to be some sort of fatal wound. The veterinarian walked over and started by saying “I don’t treat people! I treat animals!”, but he tried to do what he could. When the police showed up, they asked him questions, and he gave very detailed answers about what was wrong with the guy – dislocated shoulder, ruptured spleen, and a few other things; all this information given even though the patient couldn’t talk to him. The cops looked at him funny, and he said “what can I say? I’m used to dealing with patients who can’t tell me what’s wrong – I have x-ray hands, I guess.”

That’s the kind of art I’m talking about: Being able to diagnose issues in a system when there’s no evidence of a problem other than, say, a complete failure of something. Using an example from my own IT career, I got a call one day from the help desk saying “users can’t log into a particular server”. I was connected to the server, so I looked at a drive mapped to the server and said “well, the server’s up, send someone out to look at the workstation.” After a while, the number of calls escalated, and started coming in from other parts of the company: Clearly something was wrong, even though all these servers were up. It turned out that the authentication system on the system had deadlocked, and that didn’t prevent access for those already logged in, but for those who were just getting started, it meant they couldn’t do anything.

3 days of troubleshooting ensued, and it turned out to be a bug in the code for an update we’d just applied – a one in a billion chance of it happening, and we just got lucky – the right amount of load on the servers, the right number of users logging in, and the right distribution of services across the servers in the environment.

For those familiar with Novell’s NDS product and with my writings on Novell’s Cool Solutions and in the support forums, you’ll know that my least favourite way of troubleshooting a directory problem is to start running repairs. Troubleshooting requires finesse, not brute force. This is true for any technology, not just directory technologies (though I might argue that for directory technologies, it’s more important to not use brute force to fix problems one doesn’t understand because of the distributed nature of the technology). Take some time to understand the problem, particularly if it’s a serious problem. While your users will likely not thank you for not getting the system running immediately after a failure, if you can isolate the problem and fix it for good, they will appreciate that the system isn’t going down again.

This isn’t to say that a band-aid solution is never appropriate: Sometimes, you’ve just got to get the thing back up and running as quickly as possible. The financials server has crashed, and the quarterly sales figures are due today. Not much you can do about that, if rebooting the system is going to get the system running so the finance people can get their work done (which might be work that is required by, say, the Securities and Exchange Commission in the US).

Just don’t leave the band-aid on. If you’ve ever left a bandage over a wound for longer than you should have, you’ll understand what I’m saying here.

As someone now involved in Novell’s certification programs, this troubleshooting aspect is something that I’m happy to say there’s some traction for in the certification paths. Troubleshooting and optimization – the latter of which I may write about in a later post – are two key points of discussion as we design the new paths. It simply isn’t enough to just be able to install and manage the system, you’ve got to be able to fix it when it’s broken.

That’s one of the things they pay you the “big bucks” for.

Just like in the doctor’s office – a GP might tell you to try a handful of different drugs in order to treat the symptoms, but a specialist (such as the very good allergist I saw about a month ago) is going to actually do some diagnostic work and determine the actual root cause before prescribing a treatment – and that treatment is going to be targeted at that root cause rather than the symptoms.  The specialist may not be able to give you good news all the time (as in my case – a course of 3-5 years of allergy shots was the resulting prescription), but the chances of the problem being solved for good are much better.

Advertisements

Leave a Comment »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

Create a free website or blog at WordPress.com.

%d bloggers like this: