Why incidents occur. What we are doing to prevent them.

I've come across an interesting video that highlights in the simplest terms possible why incidents occur: because a change has been made to the system.

The interesting segment is from the time 2:20 onwards:



They key parts from the video are where they mention:

Change is what leads to problems and incidents. The vast majority of incidents are due to someone making a change, mistakenly believing that the change is not going to lead to an outage.

Being intelligent about change will make a big difference to IT.

25% of problems come from infrastructure... the vast majority of the rest comes from changes that are wrong that have been made with the best of intentions.

The speaker goes on to tell a story about 3 programmers working together producing punchcards for a system. Taking punchcards they had, they made very careful, thought out changes. However, still things went wrong.


We have been experiencing a number of problems in the environments managed by the Operations team I lead. These environments are a series of components, software applications as well as hardware, with particular inputs and outputs. These components are connected to each other in various ways and configured through configuration files. To get a particular service playing through the TV, an engineer will need to hand craft the configuration of these series of components, ensuring the inputs, outputs and configuration of each component is correct. This is much like the story of the punchcard system that is described in the video.

The problems that we experience on our environments are almost certainly due to slight mis-configurations that are sometimes made by the engineers. As a consequence, we have started looking at automating these changes. If we are successful, we will be able to use a web interface to make the changes that are normally hand made. In doing this, we should be able to expect that the changes made by this automated system will be accurate and correct every time. This in turn should dramatically reduce the number of problems experienced due to mistaken human changes.