June 15, 2018updated 18 Jun 2018 5:13pm

IT incident management: the problem with problem solving

By John Rakowski

Einstein once said that if he had one hour to save the world, he’d spend fifty-five minutes defining the problem and just five minutes finding the solution. For those who have sat in IT incident war rooms where Mean Time to Resolution rapidly becomes Mean Time to Innocence, fifty-five minutes might seem optimistic.

To the user, applications are simplifying. With a spoken command, our digital assistant or smart home device can play music, summon information on our daily schedule or provide us with ticket costs and weather conditions at a given travel destination.

But this simplicity for the user comes at a massive cost of application complexity. Today’s application architecture is highly dispersed, massively inter-dependent and needs to operate and update at breakneck speed. All this means finding a problem, and its cause, has become immensely challenging.

Every millisecond counts in IT incident management

The minutes lost in identifying a problem are more critical to business than ever before. When it comes to loading time, every millisecond counts.

Poor app performance can manifest itself in a number of other ways too. Errors such as being redirected to blank web pages, website outages and crashing apps are just a few of the main offenders that rile people.

Companies need better ways to identify problems or pre-empt downtime before they lose business or see loyalty erode. In this context – the old methods of round robin finger pointing, with each team bringing disparate monitoring data to the table – must change.

In today’s complex app maintenance environment, a business-impacting problem might have one or multiple points of failure. The ability to resolve might sit with siloed internal teams, service centres in different time zones or third party cloud operators.

Finding these points of failure is the first step, but it will only become more challenging as the pace of change accelerates. So how can IT organisations not just cope, but thrive, in this complex environment?

Cross-team collaboration is vital to IT incident management

All of the businesses I work with are continually evolving their production environments or expanding their digital services – whether it’s through in-house innovation or acquisition.

For many, that means increased complexity and increased demands on IT. Often teams have multiple monitoring tools, either custom builds or simple analytic tools built into point systems. These siloed systems can’t offer a complete view and often show the effect of poor performance rather than the root cause.

When a problem occurs, these systems simply don’t offer the ability to immediately understand and act on performance-facing issues in real time. If IT are to spot and deal with incidents before they impact customers, they have to be able to see what’s happening across the entire stack, in real time.

Not only that, but performance monitoring must have the granularity and permissions to allow individual experts to drill down. This data must go all the way, to each one of an individual user’s actions, and the demands made on the system, to be able to spot root cause.

IT must work as one. Whether it’s development and operations or database and network, teams cannot afford to problem solve in siloes. IT needs a performance monitoring system that provides the granularity needed by all teams, but brings those insights together in one place.

IT incident management: establish what constitutes a ‘problem’

Consider what the business cares about. In the energy industry the consequences of failure are, at the very least, massive inconvenience and expense, and at the most, potentially dangerous. In the ecommerce industry, slow load times could massively impact revenues across a peak sales weekend. For the banking industry, an outage at midnight on payday could be catastrophic for millions.

For every business there are different metrics, different times and different performance issues that constitute a code red IT incident. Understanding exactly what these are, when these are, and what they look like, is critical to resolution.

But the real game-changer when establishing and communicating these problems back to the business, is the ability for IT to talk in the context of business performance and transactions, not CPU spikes and speeds.

Having a system that can understand and correlate information around your user’s journey and individual business metrics makes this infinitely easier. Data can then reveal the real, business-impacting problems and enable IT to set a hierarchy of response and resolution.

Baseline performance: understand your rhythm

Understanding what your baseline of performance looks like is critical to defining what performance is problematic. Setting thresholds of acceptable performance will ensure that teams can understand any anomalies at a glance and can be alerted to business-critical issues in real time.

However, we all know that load can change dramatically minute-by-minute or day-by-day. A food delivery application may see more traffic on a Friday night than at any other point during the week.

A payroll application may experience higher load at the beginning and end of the month, compared to the rest of the month. Your performance monitoring system needs to know that this load doesn’t mean the sky is falling – or else you’ll be in for a few unnecessarily sleepless nights.

Beyond the IT incident: don’t just be a fixer

IT has always been a problem-solver. Whilst spotting and fixing performance issues in production is critical to the success of the company, with the depth of insight IT can now access, it has the potential to do so much more.

Enterprise IT teams today have a far larger mandate – innovation. They are tasked every day with creating customer experiences that drive the business forward.

By spending less time troubleshooting, IT can spend more of their valuable time on the next new development targeted for production. Using their insight into the customer experience and understanding of the existing problems, IT can become a transformative partner in defining what the future of the business looks like.

Innovation is a question of creative problem solving. When James Dyson found a way to eliminate the bag in a vacuum cleaner, he didn’t start off by thinking about bags – he started with the problem.

He redefined the dilemma of creating ‘a better bag’ to ‘how to better separate dirt from air’. By examining the data, and redefining the problem of user experience, IT can find new creative ways to solve the enterprise challenges of today.

Getting ahead of the IT incident

As enterprises move towards more service-based application architectures; as more connected things feed in data; as more services move to the cloud and as systems automate more decisions, identifying issues and solving problems will become immensely more complex.

Soon, humans simply won’t be able to manage the massive volume of data and the infinite combinations of possible points of failure. The ability to see application performance, visualise your architecture and dependencies as you scale and identify business impacting problems in real time is the only way to faster resolution.

Not only this, it’s the only way to find the unsolved problems that will become your future business.