Einstein once said that if he had one hour to save the world, he’d spend fifty-five minutes defining the problem and just five minutes finding the solution. For those who have sat in IT incident war rooms where Mean Time to Resolution rapidly becomes Mean Time to Innocence, fifty-five minutes might seem optimistic.
To the user, applications are simplifying. With a spoken command, our digital assistant or smart home device can play music, summon information on our daily schedule or provide us with ticket costs and weather conditions at a given travel destination.
But this simplicity for the user comes at a massive cost of application complexity. Today’s application architecture is highly dispersed, massively inter-dependent and needs to operate and update at breakneck speed. All this means finding a problem, and its cause, has become immensely challenging.
Every millisecond counts in IT incident management
The minutes lost in identifying a problem are more critical to business than ever before. When it comes to loading time, every millisecond counts.
Poor app performance can manifest itself in a number of other ways too. Errors such as being redirected to blank web pages, website outages and crashing apps are just a few of the main offenders that rile people.
Companies need better ways to identify problems or pre-empt downtime before they lose business or see loyalty erode. In this context – the old methods of round robin finger pointing, with each team bringing disparate monitoring data to the table – must change.
How well do you really know your competitors?
Access the most comprehensive Company Profiles on the market, powered by GlobalData. Save hours of research. Gain competitive edge.
Your download email will arrive shortly
Not ready to buy yet? Download a free sample
We are confident about the unique quality of our Company Profiles. However, we want you to make the most beneficial decision for your business, so we offer a free sample that you can download by submitting the below formBy GlobalData
In today’s complex app maintenance environment, a business-impacting problem might have one or multiple points of failure. The ability to resolve might sit with siloed internal teams, service centres in different time zones or third party cloud operators.
Finding these points of failure is the first step, but it will only become more challenging as the pace of change accelerates. So how can IT organisations not just cope, but thrive, in this complex environment?
Cross-team collaboration is vital to IT incident management
All of the businesses I work with are continually evolving their production environments or expanding their digital services – whether it’s through in-house innovation or acquisition.
For many, that means increased complexity and increased demands on IT. Often teams have multiple monitoring tools, either custom builds or simple analytic tools built into point systems. These siloed systems can’t offer a complete view and often show the effect of poor performance rather than the root cause.
When a problem occurs, these systems simply don’t offer the ability to immediately understand and act on performance-facing issues in real time. If IT are to spot and deal with incidents before they impact customers, they have to be able to see what’s happening across the entire stack, in real time.
Not only that, but performance monitoring must have the granularity and permissions to allow individual experts to drill down. This data must go all the way, to each one of an individual user’s actions, and the demands made on the system, to be able to spot root cause.
IT must work as one. Whether it’s development and operations or database and network, teams cannot afford to problem solve in siloes. IT needs a performance monitoring system that provides the granularity needed by all teams, but brings those insights together in one place.
IT incident management: establish what constitutes a ‘problem’
Consider what the business cares about. In the energy industry the consequences of failure are, at the very least, massive inconvenience and expense, and at the most, potentially dangerous. In the ecommerce industry, slow load times could massively impact revenues across a peak sales weekend. For the banking industry, an outage at midnight on payday could be catastrophic for millions.
For every business there are different metrics, different times and different performance issues that constitute a code red IT incident. Understanding exactly what these are, when these are, and what they look like, is critical to resolution.
But the real game-changer when establishing and communicating these problems back to the business, is the ability for IT to talk in the context of business performance and transactions, not CPU spikes and speeds.
Having a system that can understand and correlate information around your user’s journey and individual business metrics makes this infinitely easier. Data can then reveal the real, business-impacting problems and enable IT to set a hierarchy of response and resolution.
Baseline performance: understand your rhythm
Understanding what your baseline of performance looks like is critical to defining what performance is problematic. Setting thresholds of acceptable performance will ensure that teams can understand any anomalies at a glance and can be alerted to business-critical issues in real time.
However, we all know that load can change dramatically minute-by-minute or day-by-day. A food delivery application may see more traffic on a Friday night than at any other point during the week.
A payroll application may experience higher load at the beginning and end of the month, compared to the rest of the month. Your performance monitoring system needs to know that this load doesn’t mean the sky is falling – or else you’ll be in for a few unnecessarily sleepless nights.
Beyond the IT incident: don’t just be a fixer
IT has always been a problem-solver. Whilst spotting and fixing performance issues in production is critical to the success of the company, with the depth of insight IT can now access, it has the potential to do so much more.
Enterprise IT teams today have a far larger mandate – innovation. They are tasked every day with creating customer experiences that drive the business forward.
By spending less time troubleshooting, IT can spend more of their valuable time on the next new development targeted for production. Using their insight into the customer experience and understanding of the existing problems, IT can become a transformative partner in defining what the future of the business looks like.
Innovation is a question of creative problem solving. When James Dyson found a way to eliminate the bag in a vacuum cleaner, he didn’t start off by thinking about bags – he started with the problem.
He redefined the dilemma of creating ‘a better bag’ to ‘how to better separate dirt from air’. By examining the data, and redefining the problem of user experience, IT can find new creative ways to solve the enterprise challenges of today.
Getting ahead of the IT incident
As enterprises move towards more service-based application architectures; as more connected things feed in data; as more services move to the cloud and as systems automate more decisions, identifying issues and solving problems will become immensely more complex.
Soon, humans simply won’t be able to manage the massive volume of data and the infinite combinations of possible points of failure. The ability to see application performance, visualise your architecture and dependencies as you scale and identify business impacting problems in real time is the only way to faster resolution.
Not only this, it’s the only way to find the unsolved problems that will become your future business.