blog.dinogane.com: operations

Showing posts with label operations. Show all posts

Why incidents occur. What we are doing to prevent them.

I've come across an interesting video that highlights in the simplest terms possible why incidents occur: because a change has been made to the system.

The interesting segment is from the time 2:20 onwards:

They key parts from the video are where they mention:

Change is what leads to problems and incidents. The vast majority of incidents are due to someone making a change, mistakenly believing that the change is not going to lead to an outage.

Being intelligent about change will make a big difference to IT.

25% of problems come from infrastructure... the vast majority of the rest comes from changes that are wrong that have been made with the best of intentions.

The speaker goes on to tell a story about 3 programmers working together producing punchcards for a system. Taking punchcards they had, they made very careful, thought out changes. However, still things went wrong.

We have been experiencing a number of problems in the environments managed by the Operations team I lead. These environments are a series of components, software applications as well as hardware, with particular inputs and outputs. These components are connected to each other in various ways and configured through configuration files. To get a particular service playing through the TV, an engineer will need to hand craft the configuration of these series of components, ensuring the inputs, outputs and configuration of each component is correct. This is much like the story of the punchcard system that is described in the video.

The problems that we experience on our environments are almost certainly due to slight mis-configurations that are sometimes made by the engineers. As a consequence, we have started looking at automating these changes. If we are successful, we will be able to use a web interface to make the changes that are normally hand made. In doing this, we should be able to expect that the changes made by this automated system will be accurate and correct every time. This in turn should dramatically reduce the number of problems experienced due to mistaken human changes.

Getting the team talking through The Daily Scrum

Communication is key to a Service Desk - team members need to feel comfortable exchanging ideas about how to solve problems. However, working in a flexitime working environment, one by one the members of the team crawl into work. Some start studiously looking through their email. Others put on their headphones as they get on with their work for the day. In an apathetic environment like this, the practise of having a Daily Scrum meeting becomes the heartbeat to kicking off talking and chatter among the team for the rest of the day.

The Daily Scrum is a concept taken from Agile Scrum project management method used most frequently for software development. There are plenty of articles on the Internet that describe how it works, like here and here, as well as articles on how it does not work. In essence it is a daily round circle, stand-up meeting where each member of the team goes around in a circle and tells the team what they have been working on since the previous daily scrum, what they plan on working until the next one and any blockages they have to completing any work.

Unlike long running projects, a Service Desk environment is based on incidents so it is often difficult to predict what team members are going to be working on. However, the meeting acts as a useful way for members to synchronise with the rest of the team, giving everyone greater visibility of each others' work and providing an opportunity for them to help each other. Particularly when there are major incidents, the meeting keeps everyone focused and aware of progress.

A glimpse of a slick, professional team

I consider note taking to be a key behaviour of members of a world class service management desk. Note taking while investigating issues creates an audit trail that easily gives the engineer working on an issue, as well as others in the team, a trace of how something has been investigated. It allows others to be included on the investigation, allowing them to make contributions.

Without the key behaviour of note taking, the service management desk becomes prone to common problems that frustrate stakeholders of less well managed service desks:

Lack of visibility of issues raised by end users
Engineers progressing issues in isolation and difficulty in tracking the progress they have made on issues
Difficulty in different engineers picking up and progressing issues worked on by other engineers
Difficulty in work done by an engineer to be peer reviewed and retrospectively reviewed
Over-reliance on specific engineers for specific tasks

I am currently building a service desk team, with many members of the team inherited from elsewhere. Getting their adoption of the note taking practise has been slow to happen, but today I've started to see a glimpse of the kind of slick, professional team we are working towards: able to pass issues between engineers easily, with clear visibility to anyone interested of the technical investigation done and confidence in the capability of the team rather than individuals.

Specifically, the glimpse that I "saw" was that

Engineer 1 completed some work to build a new server to a very particular specification. He had recorded the details of his investigation on the ticket that was raised, #12. At first glance, the notes on the ticket seem excessive and as though not much thought had gone into them. They usually never are excessive and that there are notes always is the key, not necessarily the quality. The build of the server took almost 3 weeks to complete, between Engineer 1 working on other things.
Recently, almost 3 months later, a similar request, #354, came in for machine of the same specification to be built. In the past the engineer picking up the issue would have had to reinvestigate and re-determine how to build such a machine. In fact, the task of building this machine might have fallen to the same engineer who had previously worked on the issue, as that engineer might remember some of the details of what they had done 3 months previously in the previous occurrence.
However, because there are sufficient details on #12, a new engineer (Engineer 2) was able to pick up the new ticket, #354, and complete the work for the new server. I'm sure he sought clarification from Engineer 1 on some things, but there is enough in #12 to confidently work on this new similar issue on his own. He was also able to complete the work for ticket #354 quicker than the time taken to complete #12 – days rather than weeks. This is because he did not have to do any rework or reinvestigation done for #12.
This alone I thought was a great improvement in working practises…. But it gets better! Engineer 2 was away today and a further request was made on #354 by the user who logged the issue. In the past, this might have had to wait for Engineer 2 to return to work because no one would have been quite sure of what had been done. However, Engineer 2 had also made notes on #354 as he progressed the issue meaning a third engineer, Engineer 3, could respond to this and progress the issue further.

We still have some way to go for stories like this being true of every incident that we deal with. However, I think it is encouraging that we are now starting to the note taking behaviour being adopted and the benefits of this.

Customer Service Reviews

When I was a Support Engineer, my manager would make a point of visiting as many customers as possible to find out what they thought of the department's service. He would not just visit the senior managers at the customer, but more importantly he would want to talk to the people who used the service of the Support Department on an every day basis. At the time, when he would do all this, I would think to myself, "Are these visits necessary? Surely it is obvious whether you are providing a good service or not". Later on, when I managed the department, it became apparent that I sometimes did not have sufficient visibility of the pain experienced by the customers or even how and why they were using the service of the Support Department in a particular way.

I'm five months into my current role in a new company and I am realising that these kinds of "Service Reviews" are more important than ever. It is easy to drop into a false sense of security, because the "customers" for this Service Desk are internal. I speak to the managers and team leads of these "customers" regularly, but I am only now realising that I don't get from them the full detailed picture or understanding of the customer's requirements.

The elements I am covering in these reviews include:

How the customers find dealing with the Service Desk. Some of the typical feedback I have got includes things like "not being given estimates of how long things will take;this means the customer is not sure whether to get on with other work"
Frustrations the customers are finding with the applications/services supported by the Service Desk. Feedback in this area includes issues relating to the instability of the applications and certain reoccurring incidents in the infrastructure.

The feedback from these areas feeds into the continual improvement of the Service Desk. Ultimately, it does not matter how elegant the infrastructure is or how well things work - the real measure of your success is dependent entirely on what your customers think of you.

Giving customers visibility of issue progression - Skype example

A week ago there was a massive outage at Skype - none of their 9 million users could use their service for 2+ days. You can imagine that if Skype is one of the central ways in which you speak with your friends, you would have been very frustrated - the frustration you would feel with an outage to your mobile phone network for a few days.

What is interesting is that they used their blog to keep their users updated on progress: http://heartbeat.skype.com/

If you look at the entries for the month at
http://heartbeat.skype.com/2007/08/
you can see the entries they made throughout the incident to keep their users posted on what was happening. I've copied edited down snippets here and I really recommend going through these updates pretending to be one of the frustrated Skype users wanting their service working. I have further comments below.

Problems with Skype login
By Joosep on August 16, 2007.
UPDATED 14:02 GMT: Some of you may be having problems logging into Skype. Our engineering team has determined that it's a software issue. We expect this to be resolved within 12 to 24 hours...

Thanks for your support
By Villu Arak on August 16, 2007.
We'd like to thank everyone who has taken the time to send us their thoughts...

The latest on the Skype sign-on issue
By Villu Arak on August 16, 2007.
... we wanted to dispel some of the concerns ... The Skype system has not crashed or been victim of a cyber attack...

Further on the sign-on issue
By Villu Arak on August 17, 2007.
...We feel that we are on the right track to bring back services to normal. (Updated at 2:15am GMT)

Where we are at 0400 GMT
By Sten on August 17, 2007.
...We're fixing issues in our networking software and monitoring the clients getting online with increased success...

Looking slightly better at 0700 GMT
By Sten on August 17, 2007.
...even though it is too early to call out anything definite yet we are now seeing signs of improvement in our sign-on performance...

Where we are at 1100 GMT
By Villu Arak on August 17, 2007.
...We're on the road to recovery. Skype is stabilizing... Neither Wednesday's planned maintenance of our web-based payment services nor any form of attack was related to the current sign-on issues in any way.

Update at midnight GMT
By Villu Arak on August 18, 2007.
...Skype presence and chat may still take a few more hours to be fully operational....

The words we've all been waiting for
By Villu Arak on August 18, 2007.
Take a deep breath. Skype is back to normal.

What happened on August 16
By Villu Arak on August 20, 2007.
...The disruption was triggered by a massive restart of our users' computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update...

Now, as you were waiting for your Skype to start working again... How did that make you feel, reading those updates (compared to not having anywhere to look to see what was going on)? What if they were updating the blog every few minutes as they worked on the problem rather than every few hours and added technical detail which you may or may not understand - would that have made you feel more or less happy that the problem was being investigated and resolved? Then compare that with the service from, for example a bureaucratic government organisation or even your lawyer during the process of buying/selling a house. There is no place to go and see what is happening with your issue and you feeling you are banging our head against the wall - constantly chasing for updates through phone calls or other means.

This Skype example gives a glimpse of what is possible through using a ticketing/bug tracking system when engineers working on the problem update those tickets with notes. The dramatic increase in visibility of an issue being progressed gives greater confidence to customers, reducing their anxiety.

blog.dinogane.com

Pages

Why incidents occur. What we are doing to prevent them.

Getting the team talking through The Daily Scrum

A glimpse of a slick, professional team

Customer Service Reviews

Giving customers visibility of issue progression - Skype example

Author

Subscribe via feeds