How to Quick-Start Proactive Problem Management

No matter how efficient your incident management process, you will always have more incidents than you can handle—until you get proactive about problem management.

Where do you start?

Part of our Making ITIL 4 Simple series.

8m Deep Dive

Developing a mature problem management practice can have a transformational impact on your service desk and the broader IT group—because it greatly reduces unplanned work and avoids operational costs. In turn, it gives you back time and money which can be reallocated to improvement and innovations.

Problem management is your friend because it’s the #1 way to end the firefighting—the unplanned work that disrupts an IT person’s day. By eliminating a chunk of unplanned work (as much as 50% in just a few months of running a proactive problem management practice), you give IT people time back—so they have the bandwidth to push innovation projects forward. It’s a virtuous cycle.

Unfortunately, many organisations aren’t where they need to be with the problem management practice. Research from HDI indicates 61% of organizations are “doing” ITIL problem management. However, most of these are only doing reactive PM, driven by frequent major incidents.

Why are organizations not where they need to be with problem management? Firefighting gets in the way. It’s a Catch-22 situation: a mature problem management practice ends the firefighting, but the constant firefighting prevents them from maturing their problem management practice. This is a vicious cycle.

So how can we shift from vicious cycle to virtuous cycle?

Be agile. Find the single biggest problem. Solve it. Use the time that released to scale-up your problem management practice a little and solve the next most painful problems. Start redirecting operational resources from the “daily grind” work to transformative project work. Keep going.

ITIL 4 guidance can help:

Start small (See ITIL 4: Start Where You Are, ITIL 4: Keep it Simple and Practical)
Focus on the problems causing the most damage (see ITIL 4: Focus on Value).
Use the time gained to iterate (See ITIL 4: Progress Iteratively with Feedback)

RELATED: See all our ITIL 4 Articles

What is a Problem?

Before we dig into the detail, and what ITIL 4 has to say about problem management, let’s quickly re-cap to make sure we’re all on the same page with definitions. According to ITIL 4, a problem is:

A cause, or potential cause, of one or more incidents

ITIL 4 Foundation Volume, Page 130

The word “potential” appears because—in the context of modern, automated ITSM technology—a problem is the underlying root cause of none or more incidents. We say none (instead of one) because in a mature problem management environment (supported by mature event management), a problem can be automatically discovered and automatically resolved before a service consumer experiences an issue. Before an incident record is ever created.

This is the realm of AIOps/AITSM, where immediate detect-and-correct automations and predictive AI can be applied to automate much of IT’s daily ITSM/ITOM workload. You can find out more about this in our forthcoming AIOps and AITSM series. Subscribe to this blog to make sure you don’t miss these.

Now back to the issue of improving the ITIL problem management practice….

Why Does Problem Management Matter?

Where incident management is about speed, problem management is about quality: taking time to properly investigate the problem, identify and validate a root cause, propose a solution, apply it, test it, deploy it, document it, and so on. Problem management is a fastidious, detail-oriented, technical practice.

You could argue it’s about stopping incidents flowing into the service desk, but that’s an inside-out IT perspective, not a service consumer perspective. Problem management is about improving service quality for customers—preventing service disruptions which reduce employee productivity—that’s what is really important here.

Customers care more about always-on services than they do about the stress levels in your service desk. The focus should be on solving problems which are impacting the customer experience. Reducing the number of calls coming into the service desk is a secondary benefit.

Problem management matters because it’s a force multiplier. Incident management is focused on solving one incident. Problem management is about solving many. So, the impact of problem management on business productivity is greater—by an order of magnitude.

A Brief Guide to Problem Management

In ITIL 4, the problem management practice is made up of three sub-practices—Problem Identification, Problem Control, and Error Control—each of which has a small set of clear responsibilities. Proactive problem management isn’t as scary as people sometimes think.

This can help you assign responsibilities in a way which spreads the load evenly across members of your team. This will help ensure your problem management practice is sustainable—as well as avoiding potential conflicts between tasks. The diagram below sets out the main activities in each of the problem management sub-practices:

How to Quick-Start Proactive Problem Management

Now that we’ve covered the basics, we’ve set the scene to talk about where you can start, and how you can gain some quick traction with ITIL problem management.

Start by Attacking Your Top-10 Incident Pains

An easy place to start is to pull a report of your top 10 most frequent incident types from your service desk/ITSM solution. E.g. find out where your service desk agents are spending large amounts of time on the most frequent recurring issues.

SOLUTIONS: ITSM Dashboards & Reporting

Calls to the service desk follow a pareto distribution, sometimes called an 80:20 chart (see the diagram below), meaning the majority of all calls come from just a handful of causes—seen here as the “head” of the chart, marked in red. The remaining calls come from a much larger set of different causes—the “tail” of the chart, marked in blue.

How to Quick-Start Proactive Problem Management - incident chart

Problem management is all about understanding cause and effect. By pulling a report of the top-10 calls by volume, we can identify the “head” (marked in red). These ten issues are likely causing 30%-50% of all calls to the service desk.

This is powerful because it allows us to solve a large chunk of the total call volume coming in to the service desk by addressing just a handful of issues.

In many organizations, the biggest stack is caused by password reset requests (which can be easily solved by providing self-service password reset tools to end users via a web and mobile…but let’s not get ahead of ourselves).

The “long tail” (marked in blue) represents a smaller number of calls than the head, but there will be hundreds or potentially thousands of underlying problems. Eliminating these requires exponentially more effort. Focusing on the head gets you the biggest reduction in calls—in the fastest time. Ignore the tail until you have dealt with the head.

There is a nuance here to be aware of. Is a report of your top 10 calls really business-focused, or is it more focused on reducing stress on the service desk? Solving these top 10 problems might reduce the volume of calls (or self-service loggings) coming in to the service desk, but what impact will solving them have on the service consumer experience? Are they the top priority for the customer?

Look again at your report. What are the business priorities of these incidents (as measured using your priority matrix)? Do the priorities match up against the call volumes? If not, reorganise them based on the business priorities to ensure you are fully aligned with business demand. Now go and talk to your business unit heads to validate this list before you get to work on doing something about it.

How to Quick-Start Proactive Problem Management - people

They’ll really appreciate the chance to provide input on priorities, and they may even get excited about you closing-off some annoying recurring issues which are hurting their productivity. This means they are more likely to provide valuable support if you need it (e.g. when you are asking other IT teams to help out with applying the relevant fixes). More on that later.

When you have this business-validated list, these problems will be the focus of planned work for your problem management practice.

This planned work should be the core of what your problem management practice does, but it can be over-ruled in the event of a major incident—one which is causing catastrophic loss to the organization. Major incidents are Priority 1 incidents which require an all-hands-on-deck response. The problem management team will park planned work and instantly divert to investigating the major incident.

But remember…with a mature problem management practice in place, the number of major incidents will reduced over time—tipping the balance from unplanned to planned work. Once again, we’re back to the benefit of reducing unplanned work to make room for improvement and innovation.

How to Quick-Start Proactive Problem Management - inverse-relationship-problem-management-unplanned-work

Pro-active problem management is your friend because—although it doesn’t seem glamorous—it helps you accelerate innovation and execute the digital agenda.

Doing Something About It

So far we’ve looked at Problem Identification and the prioritisation element of Problem Control. Now you know what to focus on. It’s time to do something about root causes so you can systematically close off incidents permanently. We’re now going deeper into Problem Control, where things start to get more technical.

We need to investigate two things:

Is there a workaround that will restore the service quickly, sidestepping the need for detailed analysis and getting service users back online faster?
Where is the root cause? And why did a service interruption happen? This is the detailed analysis that needs to happen to close off the source of these incidents for good.

This is where a CMDB can accelerate you to both of these goals. If you have a complete and accurate digital view of your IT ecosystem—one which shows you both the relationships between service components and the status of those components—you can get an understanding of cause and effect more quickly (versus sending people to physically inspect hardware or manually parse system log files).

Once you have identified the root cause, you may need to create a change record to trigger the work that needs to happen to apply a permanent solution.

The challenge is that many of the actions that need to be taken to eliminate these flaws happen outside of the service desk; beyond the reach and power of the problem management team. Service desk agents don’t have direct access to the different technologies: privileged systems like databases, servers, network devices, storage, and cloud management. In large organizations there are teams dedicated to each of these types of system—and for the sake of stability and security they don’t share access. That means that problem management practitioners need to reach out to effect a change. That’s what a change ticket is—a formal request from one team to another team to take an action that they themselves cannot take.

Ultimately, problems are eliminated through this process. With each problem solved, the IT ecosystem is made more robust and another stack of calls to the service desk are avoided.

How to Quick-Start Proactive Problem Management - happy end-user

Don’t Stop Now

Once you’ve solved your top-10 pain points, you should now have taken some pressure off your service desk—perhaps enough already to reassign some of the more tech-savvy agents to your problem management practice to make even more progress, faster. Business stakeholders will have noticed a difference too—now that a significant number of disruptions to employee productivity have been avoided.

If you pull your incident-volumes-by-category report once again, it will probably look quite a bit like your original report—but with a shallower slope. You have “cut off the head” but not slayed the beast. The “long tail” issues will still be popping up at the service desk again and again. There’s more to be done. With problem management there’s always more to be done.

Where to Start

Taking the first step towards improvement often means taking a step back from the firefighting. Something has to give way to make room for improvement. That means there might need to be a temporary drop in incident management performance to create the slack you need to create a permanent up-tick in performance. After that, you should have created enough bandwidth to easily catch-up with the backlog (now that the volume of firefighting is reduced) as well as have some spare capacity to tackle more. From there, each problem you tackle reduces the volume of unplanned work a little bit more.

In the meantime, you need to communicate with your business stakeholders. They need to know what you’re planning and why. They need to know that they might need to take a temporary hit in order to help IT turn a corner which will ultimately enable a much faster pace of innovation in the long run. When they know what’s coming, they can plan a way around it.

Proactive problem management is your friend. It has the power to transform your support operations in just a few months. Proactive problem management activity makes time for more proactive problem management activity. Each step forward creates a ripple of positive effects across IT. It’s a virtuous cycle and it’s one of the keys to achieving high IT maturity.

To see ITIL 4 expert Troy DuMoulin and ITSM Solutions Consultant talk about how you can kick-start pro-active problem management (and 4x other ways you can quickly apply ITIL 4 best practices), watch this on-demand webinar now: