Better Performance is Cheaper
Which is more expensive?
1. Having calls solved within 8 hours on average
2. Having calls solved within 16 hours on average
The logical response is 1 seen from the traditional market perspective. In fact, it is not possible to say. Depending on how the IT service provider is organized, it may be cheaper for both customer and supplier to provide a higher level of performance.
IT service providers would have us believe that 1 is more expensive than 2. In general, an IT service provider will charge a premium price for a “higher” level of performance. However, all but the most extreme performance does not warrant a higher price. And “all” covers most of the Service Level Agreements that have been concluded.
“Expensive” obviously has two sides: the price for the customer and/or the cost to deliver the service. And, of course, “expensive” is a matter of perception.
Let’s take the situation at an average IT service provider (internal or external).
The cost level (and, as a result, the price level) of an IT supplier is principally related to the number of people required to deliver a service. This includes overhead and other such cost components. To illustrate the principle, let’s look at an average day’s worth of incidents. The figure below depicts the traditional double-peak incident graph. Incidents tend to be discovered when people log on (in the morning or after lunch).
The question is: how can the IT organization ensure that the agreed service levels are met. The answer is: by ensuring the in the above graph, an amount of employees. In the above graph, an amount of employees has been allocated to incident-solving. In order to deliver a higher level of performance, the traditional response is to add people to the incident-solving capacity. The essence of the next graph is that the amount of “curve” above the line becomes smaller, which means that less calls end up having to wait for resolution (in comparison to the graph above).
The key aspect is that the additional personnel required to raise the line is relatively limited. Where higher Service Levels tend to command mark-up of between 25 and 100% (occasionally more), the number of extra people required for the substantial performance improvement may be between 5 and 20%. Generally speaking, the rule is: the higher the thick dotted line, the shorter the average resolution time of the incident. This is basic Queuing Theory.
“Additional personnel” means higher cost, however limited. How does this match the fact that “better performance is cheaper.” The fact that incidents are solved more quickly means that employees will be able to process a larger number of incidents in a given time interval. If the processing “power” is greater than the workload, any backlog will be reduced substantially and eventually removed.
Backlog is a phenomenon that dictates the quality of an IT service provider and the services it delivers. Where there are backlogs, there is automatically a higher level of coordination. Coordination is one of the most important sources of poor performance and excess cost. Managing work backlogs is a highly intensive activity, much more intensive than managing a limited number of extra employees.
The problem with work backlogs is that keeping track of tens (or even hundreds) of calls is a logistical and content nightmare. Backlog can be visualized as a pile of work. Customer dissatisfaction is proportional to the size of the pile. The problem with the pile is that work (incidents) fall onto it and incident solvers tend to take incidents off the top of the pile, i.e. a typical case of LIFO (Last In, First Out). The business effect of the backlog increases since it is a combination of the size of the pile and how long something has been in the pile that determines the business effect.
So, what is the effect of reducing backlog? The graph below shows an actual situation at a multinational organization with a shared IT services center.
The column represents the average total number of incidents open during the month. The purple part of the column represents the backlog. The backlog is defined as any call older than 3 days old. The light blue part of the column is the number of incidents less than 3 days old. The latter is a remarkably stable amount (between 80 and 90 from December 08 to April 09). During a project to improve the performance and reduce the costs of the shared IT services organization, a concerted effort was made to reduce the backlog. It was reduced from 483 (on December 1st, 2008) to 6 on May 22nd, 2009.
The effect was that, having removed the backlog, the existing IT population was able to solve the remaining incidents more quickly, thus reducing the total number of open incidents. It rapidly became apparent that the number of IT employees could be reduced and yet maintain the level of performance. The number of employees associated with incidents was reduced from 14 to 12. These were external employees, thus saving close to $400k.
A similar effect can be achieved with standard changes, requests for information and other such customer requests. It does however not apply to non-standard changes (i.e. projects). The improvement of price-quality balance is approached differently.
Overall, the effect was that the cost of IT was reduced, leading to lower prices in the following year.
This example shows that the sizing of (part of) the IT organization is based on the number of incidents and on the amount of time spent working on solving incidents and not so much on when the incidents are solved. If the organization is sized correctly, the incidents of one day should be solved in one day. Working with priorities ensures that the incidents with the most business impact are solved first, but still all incidents of a given day are solved on that day.
Managing Impact on the Business
The second part of the improvement is delivering a higher quality at the same (or lower price). As we have seen, reducing the cost of IT through managing workloads is simple and effective. It is also very easy to explain and manage.
The workload of incidents has a double effect. On the one hand, it leads to a reduction in the cost of IT. On the other hand, it leads to considerable benefits for the business. A large proportion of incidents leads to a certain level of unavailability of systems for users. This unavailability reduces the ability of business users to do their work.
Consequently, by reducing the time that incidents are open, IT is ensuring that business users experience less unavailability (in terms of workable time). The number of disruptions may stay the same but their overall effect is significantly lower than if the workloads were at their previous level. This effect can be followed by measuring Lost Production Hours, a Key Performance Indicator made up of the number of incidents, the time to repair of those incidents, the average number of people affected and a dependency factor (how dependent is the user on their IT).
A similar effect is associated with standard changes. Customers also benefit from the quick delivery of standard changes. They are able to be more productive as a result of the delivery of the change, e.g. a prompt creation of a user profile or a rapid delivery of a new laptop can help business users become or remain productive.
As shown in this article, IT must focus on decreasing backlogs and ensuring workloads and workforce remain balanced. The key characteristics of an IT organization that works in this way are:
- A workload of incidents is 1 days’ worth of incidents, of which the oldest is (max.) 3 days old.
- A similar workload is present for Standard Changes;
- A service desk with high 1st line resolution rate;
- IT employees who start working on incidents as soon as possible without regard for the limits agreed in the SLA, only with regard for the business impact (priority);
- Attention to Problem Management in order to remove structural deficiencies from the IT infrastructure and develop workarounds for rapid restoration of the service.
The result is a more proactive attitude from IT with a direct effect on customer satisfaction.
A word of warning to all business managers: Just because IT is becoming more efficient does not necessarily mean that the organization needs to be downsized. It is vital to keep an eye on the number of incidents and the average time worked on these incidents in order to size the IT organization. Reducing the size (and, thus, capability and flexibility) of the organization can cause a negative effect on the Lost Production Hours: a typical case of penny wise, pound foolish.
The key is, however, that it is a major misconception that high IT service comes with a high cost of IT! In fact, better performance is cheaper for everyone.