The Architecture and Design of the Network

Part 7

The Operational Environment

Geoff Huston, July 1994

Copyright (c) Geoff Huston, 1994


Operation of a Service

      - Service Quality is achieved by a match of capability to demand:
            technical capability to carry user load
            financial capability to provide adequate resource

In terms of overall operational management of the network service, it is noted that the quality of the service is related to both the capability to carry the end user traffic without significant degradation of end to end performance and without high variability in end to end performance, and the financial capability of the network service provider to fund the acquisition of transmission and switching infrastructure to match the demand levels.

It is noted that such considerations impact directly on network funding models, where there is a requirement to ensure that additional demand levels is accompanied by additional financial capability to delay resources to match the demand level.

It is not proposed to describe various financial structures for network services within the scope of this document, other than to note that the financial structure must match both the requirements of the service itself to scale the resource investment to match demand levels, and match, to some extent, the expectations of the client base. It must be noted that this is an area where there is considerable diversity in the current Internet environment, and there is no commonly agreed business model at this stage. The basic pricing mechanism is that pricing should reflect the requirement of the service provider to meet the costs associated with the service. The precise nature of these costs, and how they are derived from the underlying resource costs area areas yet to be well understood across the Internet community. A number of papers which examine the economics of the Internet can be found at <this reference>.


Network Operation

      - Management of IP numbers is critically important:
            Ensure network number registration information is accurate
            Publish correct IP numbers to external network peers
            Ensure that correct IP numbers are routed
            Ensure that end clients are using correctly allocated numbers

The management of IP network numbers is critically important to the network provider, both in terms of management of routing advertisements and in terms of maintenance of data sets which describe each clients network's and any applicable imposed by the client on their networks.

When a client requests a network number be routed it is important that an independent IP network number registry be able to verify that the client has been allocated this number originally by a registry, and both that the number has been uniquely allocated, and the client is the entity in receipt of this allocation.

Considerable care must be taken that external peer connections advertise an accurate collection of network numbers, and where explicit network numbers and routes are imported via an external connection a periodic audit of such network numbers against the registry is advised to ensure that the provider is exchanging numbers in accordance with the end client's expressed policies, and that the numbers form part of the globally unique allocated pool.

Careful, accurate attention to network number management and number handling is essential for the Internet. As there is no national or provider prefix, there is no unique "handle" which automatically identifies an address with a location or provider. The interests of the providers as a collection of service entities is best served by interoperability, which in turn requires careful attention to ensure that all numbers are derived from the single globally unique Internet address pool


Operational Management

      - All active elements of the network centrally managed
      - SNMP used as platform for management
      - routers are the central component of operations

Operational management from the Network Service Provider's perspective is concentrated in the activity of management of the routers, and through this, management of the transmission resource and management of the routing domain.

It is critical to manage the service environment through a single management structure, undertaking this activity on a centralised basis. The structure used in most Internet Service environments is to use SNMP as a router monitoring tool, and use a router configuration database management system to manage router configurations.


Operational Management

      - snmp traps used for exception reporting
      - never underestimate the power of ping !
      - traceroute - the route reporter
      - dig - DNS diagnosis

The typical operational environment is that of continual periodic polling of routers via SNMP to obtain desired interface metrics relating to service quality and usage levels, and the use of exception reporting based on both SNMP traps and SNMP poll failure.

When either of these mechanisms indicates element failure within the network domain the two most commonly used initial diagnosis tools are ping and traceroute. Ping is probably the most basic reachability tool, indicting whether a device is responding or not. Traceroute is very useful in indicating the path taken to reach a device, providing immediate notification on both routing states and, as traceroute reports trip times. providing notification on current usage levels with the path to the nominated device.

As well as component failure a common operational concern is with the operation of the Distributed Name Service (DNS), and here the dig tool is about the most useful in examining the nature of any DNS problem.


Operational Management

      - Each management environment has particular requirements
      - Routers are the most reliable network element
      - carrier services are the greatest point of vulnerability
      - careful router configuration will isolate LAN faults

It is stressed that there is no single management methodology for Internet Service Operators. Each management environment will have particular requirements, and the operational structure must be responsive to such requirements. Generally there are three areas of potential failure which will impact on the operational environment:


Operational Management

      - Internet issues - working within a larger multi-provider environment:
            NOC obligations
            trouble ticket management

It is necessary to note that the Internet is a multi-provider environment, and problems that may occur within the service provider's domain may be noted by other users, and other providers, across the Internet. It is essential to operate a Network Operations Centre with 24 hour contact points, and be able to accept other provider's trouble tickets and report back on resolution of these tickets.


Reporting

      - Goals of data collections and reporting:
             operational management
             trend analysis of traffic volumes
             monitor levels of delivered service
             monitor usage patterns
             marketing material!

Reporting on the network service is again an area which is highly variable from operator to operator, as reporting structures are very much attuned to local requirements.

In crafting a reporting structure it is important initially to identify why the report is being produced, and then be able to ascertain the strict reporting requirement based on the ultimate use of each set of figures.

A number of reporting sets are used exclusively for operational management. Reports indicating link layer errors is used in determining link quality drifts and likely outage events. Link utilisation figures are also useful in the operational management environment, in terms of being able to diagnose performance issues and be able to respond to client reports on network performance and availability.

Planning requires trend analysis reports of network usage levels, indicating the likely near term traffic patterns, based on projections of historical time series data. Such data is very useful for planning network capacity upgrades and reengineering the internal network topology to match the likely future traffic patterns.

Some figures are more general in nature and are used in a marketing environment to indicate the levels of uptake of the service, or the overall quality of the delivered service.

Other sets are used within the context of management reports to indicate the effectiveness of the service and the parameters of usage of the service.

Within each environment the requirements for reporting will differ. The requirement within the area of network design is to select those network metrics which will be gathered, analysed and used as the basis of network reports. There is little point in collecting every possible metric within a network environment. The consequent size of the data set is itself a major operational problem and subsequent analysis of the data to formulate coherent reports is a close to impossible task.


Reporting

      - Balance of cost of data collection  and analysis against
            benefit of resultant data sets

- Data collection points affect ability to gather data

It is noted that each data collection set is not without cost. and the issue within the design of the reporting structure is to balance the number of collected data sets against the number and type of network reports which will be produced. It is also noted that some data sets are not readily gathered by routers, and particular attention does need to be paid to underlying network architecture is certain data components are essential reporting elements.

The most difficult item to gather data collections is that of network usage by source / destination address pairs, reporting on the data transfer volume, time of day and initiator, for example. While such data has an obvious application in terms of imposing some form of incremental pricing on network usage, most current routers are incapable of reliably gathering such a quantity of data within a large networking switching point, and packet sniffing techniques suffer both from overall packet throughput and the consequent design issue of "breaking apart" a router switching point to pass all traffic across a wire element to allow packet header data collection.


Reporting

      - Routers:
            Interface volumes
            Line errors
            Routing tables
            Router resource use

Router data sets which are of the greatest use to the operations area include:


Reporting

      - nnstat - ethernet monitoring with a dedicated host
            gather packet header information
            source - destination volumes
            application generated volumes
            highly flexible data gathering ability
            expensive to deploy!

As noted above current routers are typically very poor in collecting data held within the headers of the switched packets. Typically such data is gathered using a technique of packet spying on a broadcast medium (such as an ethernet or FDDI ring) where every packet is passed to a data collection process within a dedicated host system. Packet header information does allow the generation of end to end flow data, application-generated data and similar.


example data collection

      - Routers
            15 minute interface volumes and error count
      - nnstat
            deployed at network peer boundary

One potential configuration is to use router interface statistics (collected via SNMP polling) at 15 minute intervals, to collect interface volumes, error counts and queue drops. These figures can be processed to give a time series indicating carrier stability (error counts) burst congestion severity (queue drops) and link utilisation (interface volumes). It is necessary to note that each interface must be treated as two independent half duplex interfaces in this context.

Such data collection can be used across the entire network to produce link-based traffic information. The derived information can be processed to give a rough approximation of end to end flow information, but such approximations can only be made with a large set of assumptions.


Network Reports

As noted in the section on internal infrastructure, datagram systems with reliable end to end protocols are very sensitive to congestion events, and to track network performance it is necessary to monitor the load of links at relatively short intervals on a continuous basis as the means of determining the frequency and severity of congestion events and monitoring average line loading levels.

Some work by the author, and similar efforts noted by the Operational Statistics Working Group of the IETF found that in general 15 minute sampling periods were sufficiently small to give good indication of such events while not entering areas of data overload. The technique generally used is o poll the interface statistics of each router every 15 minutes, and then post-process this data to produce link utilisation figures using the delta of these values for each measurement interval.


Weekly Link Report

      - weekly report of 15 minute link load levels

Shown here is a sample weekly report used as a network link utilisation report. For each link the traffic levels in each direction are averaged across 15 minutes periods, and the average traffic load for the period is plotted.

This graph is used to quickly identify traffic bottlenecks as they occur on particular links, and also provide indication of congestion points as they occur.

The report also includes a line "signature" indicating the number of data points which occur at line loadings of 5%, 10% and so on. This signature is used as an indicator of total load, as once a number of samples occur at load levels of over 75% the line can be considered a peak traffic congestion point.


Network Reports

      - monthly report
      - quarterly trend reports and projections

Longer term reports are typically useful in terms of providing longer term growth projections, and can be of assistance in forecasting trend levels for capacity.

Generally bandwidth requires some months of lead time on ordering, particularly when the capacity is in the megabit capacity level, and it is also noted that the process of funding such large expenditure items is a process which often must commence years in advance of the actual time of requirement.

It is therefore essential to be able to analyse usage figures to the level of being able to deduce large scale trends in traffic growth, and to be able to extrapolate, with a suitable level of approximation, at which stage in the future additional bandwidth (and additional levels of expenditure) are required to ensure that the network operates outside of areas of catastrophic congestion-induced collapse or operates in an administrative or system-induced denial of service mode.

It is therefore essential to be able to generate forward usage projections and map these projections into both future network capacity requirements and future network funding requirements, with a suitably large period of forward notice which matches the managerial process associated with the network.


Introduction

1. Architectural Principles

2. The Design of the Network

3. Engineering the Network - The Client Interface

4. Engineering the Network - External Peering

5. Engineering the Network - Infrastructure

6. Implementing the Network

7. The Operational Environment

8. The Policy Environment

GH, July 1994