The NTT Docomo October 2021 network outage shows the danger of network signalling storms. Because they can lead to lengthy outages, signalling storms are one of the most troublesome threats to resiliency as networks become more sophisticated and mission-critical.
While early-stage implementations of 5G have focused on increased speed and decreased latency, the mature version of 5G is designed to support the industry’s overall drive toward higher reliability for a wide variety of users, device types, and services. From the beginning, 5G was designed to support mission-critical, real-time services with improved response, but also via the ability to dedicate capacity and define performance for a specific use via network slicing. 4G and fixed-line networks are also moving in this direction: enabling new and higher-value services is critical to monetizing carriers’ extensive network investments.
Each network slice, service, and contracted performance level increases the amount of signalling traffic, both within the network and between the network and the devices it supports. As these elements proliferate, the traffic increase can be dramatic. But while this increased activity is essential to new network use cases, excess communication can quickly overload networks, creating a “signalling storm” that prevents the network from coordinating its traffic.
Docomo outage affected 2 million users
Whether the network was brought down by a flood or faulty upgrade, restoring normal function quickly after an outage is critical. In an October 2021 incident that is probably the most widely covered case of signalling storm-related problems, NTT Docomo Japan experienced a twelve-hour outage for two million users (and up to 29 hours for its 3G connections) after having to roll back a migration to a new subscriber registry.
After discovering that some IoT connections were failing, Docomo had to force re-registration of subscribers. The volume of requests overloaded the network, and as each failed registration triggered additional requests, it proved unable to work its way through the backlog until the following morning.
Device re-registration has been the weak link in several other signalling storms, and as low-power, low-intelligence connected devices proliferate, operators will continue to have to control signalling traffic entirely on the network side rather than relying on terminals for help. But since storms can happen anywhere in the signal chain – especially as edge services proliferate – these operators must design for signalling spikes across the entire network.
Signalling traffic control
To some extent, virtualization and especially cloudification will help avoid bottlenecks, since additional microservice-based instances can be spun up to handle unusually high volume. But that increased activity carries its own overhead, and can potentially be caught in failure loops of its own. At least until operators have 100% cloud-native networks, they will have to implement separate intelligence to limit signalling traffic.
While this could take the form of simple throttling, a more productive approach is to prioritize signalling based on contracts, customer value, link load, CPU load, and other technical and business factors. This intelligence will allow the operator to restore the most important traffic first while avoiding overload. It will also require advanced analytics and unification of multiple data sources; fortunately, many operators are already carrying out this work.
Inevitably, something will trigger a network outage, whether large or small. Operators should therefore design their networks to avoid kicking themselves once they are down; signalling should not create a “storm after the storm.”