Chief Architect, Quantiv
The UK air traffic control failure caused chaos on the busy August bank holiday. Thankfully, all flights could land safely. But the incident caused hundreds of flights to be cancelled or delayed.
Since then, NATS (formerly known as National Air Traffic Services) has released a preliminary report into the issue.
Reading the document, I was struck by the simplicity of the issue. There was no malicious cybercrime, just a perfectly understandable coding error.
And while the disruption continued for days, the problem was identified comparatively quickly.
The duplicate waypoints issue
The report itself has a human feel about it. It’s clear and honest, and as free of jargon as could be expected given the nature of the subject. But it still has enough technical detail to be useful and, considering its purpose, with no feeling of blame. Overall, it’s a dispassionate description of what happened, the reasons behind it, and the changes that might be needed. Arguably, it’s an example of management and regulation at its best.
The cause itself was a flight plan that included two identically named, but separate, waypoint markers outside of UK airspace. This led to both the primary system and its backup entering a fail-safe mode.
The reason seems to hark back to an earlier, simpler (pre-satnav) time. It conjures up images of travelling along familiar routes between recognisable landmarks (houses, inns, natural features, etc.).
However, as someone who works in IT, I felt an inevitability to the way the problem developed. At the first mention of ‘duplicate waypoints’, I let out a groan. From there, the conclusion appeared all too predictable. It was like a technology version of the long-running BBC medical drama series, Casualty.
The importance of good information
But while the air traffic control failure caused significant disruption for many, and the cause may have had a feel of dreaded familiarity, the response itself was heartening.
I know only too well the feeling of trying to solve a problem under pressure – although, I admit, perhaps not with the same expectations and intensity of the air traffic issue. I also know that sometimes the pressure of those situations can be made worse by actions that have been taken – or not taken – before the problem happens.
Here, while it might not have been in their minds at the time, those involved in solving the issue can draw some consolation from the work they’d done ahead of it. In some large part, the speed – and more importantly, certainty – of the response was related to having good information available. Knowing what was happening, and what had happened, in a known, concise and approachable way, was critical to allowing the problem to be resolved.
Inform real-time decision-making
But that good information isn’t just some Arthurian artefact, magically appearing just at the point when a system is in trouble. It needs planning to identify what constitutes good (useful) information. And it needs time to determine how to collect and distribute that information quickly. It can then be used to inform real-time decision-making, instead of ‘after-the-event’ diagnosis of what went wrong. To continue the hospital analogy: to make it part of the treatment, rather than the autopsy.
To define that information, you need a model for your organisation’s processing. And while the processing may be well known, it can often be hard to explain its model in such a way that it can be easily understood, monitored and used by others.
Helping articulate that model is the purpose of Quantiv’s NumberWorks method. It can define your operational processing so you can precisely identify the metrics that are most useful for your operations. And it can do this regularly or in exceptional circumstances.
This is also exactly the pattern of metrics collection supported by our NumberCloud product, which collects data you know you need to know. And if this data can be exposed through a simple API, you can use it for crisis decision-making. Plus, the data can be used in more normal operations to support expected processing, and therefore simplify and automate integration between applications.
What can we learn from the UK air traffic control failure?
So, what are the lessons to be learnt from the episode?
Avoiding duplicate names is, of course, a key takeaway.
But beyond that, it’s to make sure you always have the right operational information defined – and then to make that information available quickly so the right actions can be taken.
To find out more, contact our team on 0161 927 4000 or email: firstname.lastname@example.org