클라우드 & 가상화

Can Cloud-Native CNF Testing Follow Traditional Approaches?

2023년 12월 6일 · 5 분 소요

Cloud-native adoption creates service reliability issues that impact customer experiences. A new test paradigm can prevent them from entering the production network.

More service providers are moving cloud-native architectures from the lab to production networks. In this new frontier, they are encountering performance issues like network function failures that bring services crashing down. These failures don’t always have straightforward or well-known resolutions. And because they’re not identified in pre-production testing, they impact the customer experience in production network scenarios.

In recent discussions with customers, Spirent has been asked how to identify and provide insights for proactively fixing these issues before they penetrate the production network. The answer lies in a new test paradigm for the pre-production lab: CNF resiliency testing.

What’s so different about cloud-native? In a word, dependency

Software-based, cloud-native environments are a dramatic change from traditional monolithic networks.

Microservices comprise multiple cloud-native network functions (CNFs) that use many pods to provide specific functions. The pods are deployed across numerous nodes running on clouds that dynamically handle different workloads. As a result, a service may depend on hundreds of connections and thousands of transactions, any one of which can take too long to respond or simply fail. This high degree of dependency raises the probability of failures and makes problem identification and resolution complex and time-consuming.

Workloads have also become dynamic, with each networking vendor providing frequent updates on individual timetables.

As service providers deploy cloud-native network functions, this new level of complexity and inter-dependency is directly impacting the reliability of production networks.

The industry is just starting to realize it must rethink test strategies to identify and resolve CNF issues during pre-production, not in the production network where the stakes are much higher with costly outages and service interruptions impacting customer Quality of Experience. After all, only when cloud-native networks are comprehensively tested in pre-production will they be able to scale.

Rethinking testing in a CNF world

Cloud-native characteristics make pre-production CNF testing essential, but the old way of testing single-vendor, integrated networks is not up to the challenge because:

It’s harder to simulate reality. The cloud used in the lab to test CNFs is more stable, well-understood, and well-behaved than the cloud or clouds the CNFs will utilize in the production network.
Performance is more vulnerable. The hundreds of pods and nodes a production CNF might utilize must communicate within millisecond latencies to avoid timeouts. One delayed link in the chain can cause failures that cascade rapidly and ultimately result in 5G service failures.
The unexpected will most likely occur. When SLAs depend on disaggregated, distributed microservices interacting in just the right way at just the right time, there is a high probability that one or more links between CNF pods will break or time out.

Let’s dive deeper into CNF testing.

Probability metrics underly pre-production CNF resiliency testing

The dynamic nature of cloud-native means a given user activity may not lead to the same performance result or failure every time since it may take different paths on different infrastructures. Therefore, testing must focus on failure probabilities, causes, and the impact of failure on each scenario. Comprehensive resiliency testing must be performed under real-world scenarios with intentional fault insertions, not just ideal conditions.

CNFs implicitly expect certain packet loss, latency, and CPU and storage response times—and the cloud normally provides them. But because the expectations are implicit, the cloud does not take actions to ensure the specific CNF’s needs are met. Service providers are already seeing on average up to a dozen CNF production failures per quarter that they didn’t predict during pre-production testing.

Until now, there hasn’t been a clear understanding of the points where performance degrades enough to impact service. Those statistical breaking points need to be identified for each CNF, for packet loss, latency, CPU, and storage.

So, for example, if you measure cloud fabric packet loss at a particular location as a function of 5G active sessions you will see the point at which performance degrades (where the blue line in the figure starts to fall). Depending on the CNF, this performance drop may or may not be tolerated.

Pre-production measurements such as these determine Key Failure Indicators (KFIs) for each CNF and provide important performance insights in the lab.

This CNF resiliency testing data is incredibly powerful and something that can’t be done with today’s lab test methods.

CNF resiliency testing provides value in production networks, too

Pre-production measurements provide important performance insights in the lab—and also for the production network. By providing operations teams with the root cause probabilities of outages and their impact on 5G services, these measurements become essential factors for production network monitoring and troubleshooting. They enable resolution prioritization based on subscriber impact, as well as rapid troubleshooting and remediation.

As an example, the table below illustrates the key failure indicator metrics for cloud infrastructure packet loss compared to no packet loss, for registration, connect time, and http traffic network functions. The metrics circled in red show where cloud packet loss degrades significantly. Such data enable efficient monitoring and faster failure resolution in the production network.

The data also enable rapid root cause analysis when issues arise in production, helping operations to quickly identify and focus on the problematic area instead of doing painful and time-consuming troubleshooting on a conference call with 25 people from operations and various vendor teams.

By measuring performance for each product release, operators can quickly identify the specific release that’s degrading performance and provide the relevant vendor with precise data to facilitate rapid resolution.

The benefits of CNF resiliency testing

Resiliency testing of cloud-native environments may be more complex than traditional lab testing, but it is worth the journey. By understanding exactly what each CNF needs from the cloud and how each CNF is vulnerable, many problems can be avoided before they become an issue in production. As a result, new high-quality 5G services can be moved quickly into production and remain reliable even in challenging cloud conditions. Production network issues will be reduced and when they do happen, they can be resolved quickly.

CNF resiliency testing makes business sense as well, by harnessing cloud-native efficiencies and having the agility to reduce operating costs. Infrastructure investments can be targeted on components that will drive the biggest improvements to server performance and efficiency. And more stringent and lucrative SLAs can be offered and delivered.

At Spirent, we help communications service providers understand the impact of cloud-native on pre-production testing and introduce CNF resiliency testing. We’ve deployed our own cloud-native 5G core based on open source and have demonstrated the value of CNF resiliency testing.

Learn more about CNF resiliency testing in our article In a Cloud-Native World, Resiliency Equals Confidence in Everything RF or delve into the details in our 5G CNF Resiliency Test Guide.

콘텐츠가 마음에 드셨나요?

여기서 블로그를 구독하세요.

블로그 뉴스레터 구독

태그: 클라우드 & 가상화

Bill Clark

Principal Product Manager, Automated Test and Assurance

Bill Clark is a Principal Product Manager in Spirent’s Automated Test and Assurance business unit, where he focuses on 5G cloud-native validation solutions that address the industry shift to containerized, microservices network functions in 5G deployments. Before joining Spirent, Bill worked in product management at large corporations, mid-size companies, and start-ups, with a blend of in-depth technical knowledge, strong business acumen, and product marketing.