Service Assurance of Today
|
NEWS
|
Today, assurance solutions operate mostly in an open-loop fashion. That is a reactive, non-feedback process providing network visibility, helping to predict faults, and providing data feeds to Operations System Support (OSS)/Business Support System (BSS) solutions and operations personnel. With 3G and 4G, service assurance is a standalone solution that is laid on static, centrally engineered networks and siloed inventories. In fact, it is reported that up to 97% of assurance solutions do not support cross-platform automation. Software design for today’s assurance solutions is anchored on coarse-grained software elements, meaning it is typically housed in a few large network objects. In today’s networks, network services are pegged to physical appliances. There is a 1:1 mapping for the service-resource relationship. Therefore, the link between network resource and the service running on top is visible, stable, and predictable. This simplifies the role of monitoring, logging, and tracing, which are the three key assurance requirements for automation and Lifecycle Management (LCM) of network resources.
Today’s assurance scope includes performance monitoring, configuration and fault management, Service-Level Management (SLM), policy management, and Quality of Service (QoS), among others. That scope in 5G networks remains largely the same, and so does assurance’s role from a functional perspective. However, the ecosystem and complexity that assurance will need to contend with changes drastically. For example, it is reported that there are more than 400 network procedures in 5G networks, each of which has its own dedicated Key Performance Indicators (KPIs) and processing algorithms. According to Huawei, there will be a 10X increase in Call Data Records (CDRs) from 4G to 5G. That complexity can increase operations costs by 100% to 130% (see ABI Insight “Decoding the Economics of Telecom Big Data and Analytics.” It is vital that Communication Service Providers (CSPs) grasp the cloud’s impact on service assurance. That is the first step toward understanding how next-generation (cloud and software-based) network elements will interact with the existing ecosystem, so that they can be used with reliable effect.
The Cloud's Impact on Service Assurance
|
IMPACT
|
First, the cloud introduces new technologies like telemetry and new observability frameworks (e.g., Grafana, Istio). For example, cloud providers use their own data flows, proprietary Application Programming Interfaces (APIs) for monitoring and LCM. Cloud workloads can be Network Functions (NFs) or Information Technology (IT( apps and are ephemeral in composition. Cloud software runs on a continuous delivery model supporting fast cycles of build, test, deploy, and release. Cloud components can self-register, spin up on demand, and decommission without any manual involvement. These workloads can communicate with each other. There is, therefore, in-stack, “east-west” traffic. In that setting, traditional (stationary) tapping tools are no longer viable. By the time an alarm is raised, impacted resources may already have been rebuilt by a resource controller. The challenge for assurance is to automate configurations for fleeting, potentially one-time use deployments. Increasingly, CSPs will need to monitor (cloud software) instances and provide a consistent operational experience, ensuring that KPIs are met.
Second, clouds expand the required “assurance scope.” In addition to the physical layer, the hypervisor layer springs into existence. Hypervisors enable a pool of resources to be virtualized or abstracted from the actual physical network layer. Next, there is a potential Platform-as-a-Service (PaaS) layer for Container Network Function (CNF) hosting. And there is the app layer. Each layer is an individual ecosystem with its own objects, virtualized resources, and LCM states. Hypervisor and/or PaaS layers introduce an intermediary abstraction that disconnects resources from services. In other words, cloud software heightens north-south, vertical connectivity. That leads to much larger event activity, and it marks a significant operational change because the app overlay and cloud underlay traditionally have been assured separately. The vertical expansion introduced by the cloud increases the need for vendors to make it easier for CSPs to assure networks. An integrated view for overlay and underlay correlation and aggregation of events and alarms is a first step in that direction, and a key pillar for cloud-native automation going forward.
Third, cloud software deconstructs networks into small, lightweight, and loosely-coupled components (See ABI Insight “Virtual Machines Are the Standard in Telco Cloud, But Containers Are the Future.” A 1:1 relation among logical and physical cores is no longer in place. Physical resources can be turned into abstractions that can be mapped to any service regardless of locality. Log events are generated on different locales and time scales, all originating from a fluid and decentralized network topology (see ABI Insight “Decoding Three Key Technological Trends in the Telecoms Industry.” This leads to challenges in terms of handling policies and assurance processing rules (aggregation, correlation, etc.) at scale. A second example is dynamic Service-Level Objectives (SLOs), an indispensable part of advanced services, such as 5G slices, mobile Internet of Things (IoT), and Low Earth Orbit (LEO) satellite connectivity. These services entail extensive and sometimes global coverage. CSPs will need to specify different SLOs for the different markets and geographies in which they operate. As a result, service assurance should take the mobility aspect into account and dynamically apply thresholds to properly calculate KPI violations.
Assure the Underlay, Overlay, and Anything In-between
|
RECOMMENDATIONS
|
CSPs continue to invest in solutions that shed back doors associated with statically-defined and isolated point solutions. For example, Deutsche Telekom highlights that 70% of tickets in its operations are automatically dispatched. Vodafone reports that it has achieved savings of US$533 million from automating back-office functions. There is strong market demand for automation offerings that increase operational efficiency, cover end-to-end root-cause analysis across domains and stack layers, and implement self-healing and orchestration actions. Vendors are responding to that market demand. For example, Accedian and Infovista, among other vendors, embrace DevOps practices and tools for a low footprint and resource-efficient assurance offering. When a new service or NF is launched, their dynamic assurance offerings automatically monitor it. RADCOM and Netcracker, on the other hand, offer host-based and pod-based tapping and filtering using extended Berkeley Packet Filter (eBPF) technology to enable cloud-native tapping for large-scale, cloudified telco networks.
Increasingly, cloudified communications networks will be characterized by network topologies with their own life span and unique lifecycle states, all of which will need their own assurance processing rules. Further, with the increasing "softwarization” of networks, security flaws are on the rise (see ABI Insight “5G Security: Simplicity and Risk Management Key Constructs for Growth”). So, security metrics must become an indispensable part of SLAs, especially for Business-to-Business (B2B) services using public clouds. Infovista and RADCOM should take the security and mobility dimensions into account as they implement cloud-native automation across and within every phase of a CSPs’ network lifecycle. Assurance and policy management mechanisms must provide fully correlated visibility across multiple network domains (i.e., the Radio Access Network (RAN), transport, core, data centers), across multiple layers (network, services, devices) and within a (single- or multi-vendor) cloud stack environment, including underlay and overlay. This is no small feat in a complex and fragmented ecosystem, which is further complicated by diverse and transient network topologies.
To conclude, as discussed in this ABI Insight, assuring non-static, non-uniform, and high-mobility networks cannot be performed in the traditional way. Accedian, and RADCOM should continue to develop built-in Artificial Intelligence (AI)/Machine Learning (ML) capabilities to support CSPs’ growing complexity, proactively reduce Capital Expenditure (CAPEX) and Operational Expenditure (OPEX), and monetize their next-generation software-centric networks. Their solutions, and those of other vendors, must continue to build on AI/ML-driven analytics that help fine-tune automation frameworks and processing notifications generated from them. That must be done in conjunction with overlay orchestration (Quality of Experience (QoE) app monitoring), underlay resources) and policy management to authorize and enforce a corrective action, often proactively. As security becomes a key dimension in supplier selection, assurance subsystems (e.g., policy management) should be made aware of security rules (e.g., safe placement, Access Control Lists (ACLs), cloud policies), which are different for every cloud provider.