Rethinking Tooling for Cloud Native 5G Networks
November 25, 2021 - Anton Palagin
Vendors have invested a lot of resources in adopting their solutions for challenges the telco industry is going to face with 5G rollout.
For many years, service assurance and network analytics vendors were looking after reliability and efficiency of network operations, delivering thousands of NOC, SOC, SQM and CEM use cases. This helped operators with the troubleshooting of sophisticated networking issues and root cause analysis of massive outages to ensure continuous communication services for their consumers. We can easily name many industry-wide, recognized products to monitor performance and availability of distributed networks as well as rich service assurance products. Vendors have invested a lot of resources in adopting their solutions for challenges the telco industry is going to face with 5G rollout. Considerable efforts have gone into categorising traditional tools and processes under new buzz words such as network automation and digital transformation. However, I fear in a lot of cases this is really an exercise in selling old stuff wrapped in new gift paper.
In many recent discussions with partners and customers, I sometimes have a feeling that people consider the transition between existing network architectures and 5G as similar to when the telecoms industry migrated from 3G to LTE. The truth is, there are significant differences. If we want to receive the benefits of public and hybrid clouds, such as scalability with significant cost reductions, we can’t achieve these benefits using existing practices, process and tools which have served us well for many years. The tools and best practices which were developed to address the challenges of the old world may not work in this new environment or will lack the appropriate efficiency. For example, it’s not hard to understand why hyperscalers don’t deploy passive probing for raw network traffic, why they do not build expensive multi-dimensional, multi-metric and multi-server data lakes and why they do not monitor every single hop of every single network packet. As we look at the cloud native environment, we see that these past practices would be too expensive and will not help to solve complexity at hyper scale. So, what should we be using in their place for 5G cloud native networks?
One example of these new practices is SRE (Site Reliability Engineering) which provides a software engineering approach to IT operations. SRE teams use software as a tool to manage complex systems, solve sophisticated problems, and automate routine operations tasks. SRE takes the tasks that have traditionally been done by Ops teams, often manually, and instead gives them to engineers who use software and automation to manage the production systems and solve problems. I really enjoy how Google managed to define it – “SRE is what happens when you ask a software engineer to design an operations team.” This lesson is exactly what the communications industry needs to learn if we want to enjoy all the benefits delivered by cloud and cloud-native solutions. We need to stop thinking in terms of networking appliances even if they are virtualized, stop hop and packet -based micro-management, stop considering a network function as a vendor proprietary black box which is not compatible with anything except other black boxes from (surprise!) the same vendor. If we want to enjoy all benefits of software defined networks, we need to start thinking in the same way as software engineers do and become an SRE rather than a good old-fashioned firefighters to ensure we can realise the benefits of our 5G cloud native networks.