Internet Draft
Document: <draft-zseby-packet-selection-00.txt>                T. Zseby
Category: Experimental                                 Fraunhofer FOKUS
                                                            August 2002


                Sampling Techniques for Packet Selection


Status of this Memo

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups. Note that
   other groups may also distribute working documents as Internet-
   Drafts. Internet-Drafts are draft documents valid for a maximum of
   six months and may be updated, replaced, or obsoleted by other
   documents at any time. It is inappropriate to use Internet-Drafts as
   reference material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

Abstract

   This document describes the deployment of sampling techniques for
   packet selection. It suggests some terminology and shows different
   sampling methods for selecting a subset of packets in a flow.
   Furthermore it describes which parameters can be varied for the
   different sampling methods.


Zseby                       Expires February 2003            [Page 1]

Internet Draft  Sampling Techniques for Packet Selection    August 2002


Table of Contents

   1.   Introduction.................................................2
   2.   Terminology..................................................3
   3.   Deployment of Sampling Techniques for packet selection.......3
   4.   Sampling Methods.............................................4
   4.1  Sampling Algorithm...........................................4
   4.1.1Systematic Sampling..........................................4
   4.1.2Random Sampling..............................................5
   4.1.3Stratified Sampling..........................................5
   4.2  Sampling Frequency and Interval-Length.......................6
   4.2.1Count-based Trigger..........................................6
   4.2.2Time-based Trigger...........................................6
   4.2.3Packet-content-based Trigger.................................6
   5.   Sampling Parameters..........................................7
   5.1  Parameters for systematic sampling...........................7
   5.2  Parameters for random sampling...............................8
   5.3  Parameters for stratified sampling...........................8
   6.   Security Considerations......................................8
   7.   References...................................................8
   8.   Author's Addresses...........................................9
   9.   Full Copyright Statement.....................................9

1. Introduction

   Increasing data rates and growing measurement demands increase the
   requirements for data collection resources. For measurement
   scenarios in backbone networks it is often required to measure whole
   traffic aggregates instead of single flows. Furthermore some
   measurement methods require the capturing of packet headers or even
   parts of the payload. All this can lead to an overwhelming amount of
   measurement data, resulting in high demands regarding resources for
   storage, transport and post processing.

   In some cases specialized hardware helps to fulfill these demands
   but on the other hand increases the costs for providing the
   measurement. Since measurements are mainly a supporting
   functionality for the service provisioning, measurement costs
   usually should be limited to a small fraction of the costs of the
   network service provisioning itself. Therefore a reduction of the
   measurement result data is crucial to prevent the depletion of the
   available (i.e. the affordable) resources.  Such a reduction can be
   achieved by a reasonable deployment of sampling techniques. Sampling
   helps to prevent an exhaustion of resources and to limit the
   measurement costs. Examples for applications that benefit from
   sampling are given in [DuGG02].

   This document concentrates on the deployment of sampling techniques
   for packet selection. That means selecting a subset of packets in a
   flow. Sampling can be also used to select a set of flows out of all
   flows on the link or a set of observation points out of all
   observation points in the network. This is not addressed in this


Zseby                       Expires February 2003            [Page 2]

Internet Draft  Sampling Techniques for Packet Selection    August 2002


   document. A framework for passive packet measurement and further
   packet selection methods can be found in [DuGG02].

2. Terminology

   Flow
        see definition in [QuZC02]

   Metering process
        see definition in [QuZC02]

   Sample size
        The sample size denotes the number of element in the sample.

   Sampling (definition from [QuZC02])
        Sampling describes the systematic or random selection of a
        subset of elements (the sample) out of a set of elements (the
        parent population). Usually the purpose of applying sampling
        techniques is to estimate a parameter of the parent population
        by using only the elements of the subset. Sampling techniques
        can be applied for instance to select a subset of packets out
        of all packets of a flow or to select a subset of flows out of
        all flows on a link.  Sampling methods differ in their sampling
        strategy (e.g. systematic or random) and in the event that
        triggers the selection of an element. The selection of one
        packet can for instance be triggered by its arrival time (time-
        based sampling), by its position in the flow (count-based
        sampling) or by the packet content (content-based sampling)
        [QuZC02].

   Sampling function
        Function that determines whether an element is selected or not

   Selection interval
        Interval (specified in number of packets or as time duration)
        in which all packets are selected.

   Selection probability
        The probability with which one element is selected as part of
        the sample.

3. Deployment of Sampling Techniques for packet selection

   The deployment of sampling techniques aims at the provisioning of
   information about a specific characteristic of the parent population
   at a lower cost than a full census would demand. In order to plan a
   suitable sampling strategy it is therefore crucial to determine the
   needed type of information and the desired degree of accuracy in
   advance.

   First of all it is important to know the type of metric that should
   be estimated. The metric of interest can range from simple packet


Zseby                       Expires February 2003            [Page 3]

Internet Draft  Sampling Techniques for Packet Selection    August 2002


   counts [JePP92] up to the estimation of whole distributions of flow
   characteristics [ClPB93].

   Secondly, the required accuracy of the information and with this,
   the confidence that is aimed at, should be known in advance. For
   instance for usage-based accounting the required confidence for the
   estimation of packet counters can depend on the monetary value that
   corresponds to the transfer of one packet. That means that a higher
   confidence could be required for expensive packet flows (e.g.
   premium IP service) than for cheaper flows (e.g. best effort). The
   accuracy requirements for validating a previously agreed quality can
   also vary extremely with the customer demands. These requirements
   are usually determined by the service level agreement (SLA).

   Sampling is considered as part of the metering process. It can be
   applied at different functions of the metering process (e.g. during
   packet header capturing, before or after classification, etc.). In
   the following we consider a measured IP packets with its observation
   point and timestamp as basis elements of the parent population. And
   all packets in the flow of interest as the parent population. Please
   note that with the IPFIX flow definition the flow of interest can
   also include all packets on the link.

   The sampling method and the parameters in use must be clearly
   communicated to all applications that use the measurement data. Only
   with this a correct interpretation of the measurement results  can
   be ensured.


4. Sampling Methods

   Sampling Methods can be characterized by the sampling algorithm, the
   trigger type used for starting a sampling interval and the length of
   the sampling interval. These parameters are described here in
   detail.

4.1 Sampling Algorithm

   The sampling algorithm describes the basic process for selection of
   samples. In accordance to [AmCa89] and [ClPB93] we define the
   following basic sampling processes:

4.1.1   Systematic Sampling

   Systematic sampling describes the process of selecting the starting
   points and the duration of the selection intervals according to a
   deterministic function. This can be for instance the periodic
   selection of every n-th element of a trace but also the selection of
   all packets that arrive at pre-defined points in time. Even if the
   selection process does not follow a periodic function (e.g. if the
   time between the sampling intervals varies over time) we consider
   this as systematic sampling as long as the selection is
   deterministic. The use of systematic sampling always involves the

Zseby                       Expires February 2003            [Page 4]

Internet Draft  Sampling Techniques for Packet Selection    August 2002


   risk of biasing the results. If the systematics in the sampling
   process resembles systematics in the observed stochastic process
   (occurrence of the characteristic of interest in the network), there
   is a high probability that the estimation will be biased. In this
   context it also has to be considered that there might be systematics
   (e.g. periodic repetition of an event) in the observed process which
   one might not be aware of in advance.

4.1.2   Random Sampling

   Random sampling selects the starting points of the sampling
   intervals in accordance to a random process. The selection of
   elements are independent experiments. With this, unbiased
   estimations can be achieved. In contrast to systematic sampling,
   random sampling requires the generation of random numbers. One can
   differentiate two methods of random sampling:

   n-out-of-N sampling

   In n-out-of-N sampling n elements are selected out of the parent
   population that consists of N elements. One example would be to
   generate random numbers and select all packets which have a packet
   position equal to one of the random numbers. For this kind of
   sampling the sample size is fixed.

   Probabilistic sampling (see also [DuGG02])

   In probabilistic sampling the decision whether an element is
   selected or not is made in accordance to a pre-defined selection
   probability. An example would be to flip a coin for each packet and
   select all packets for which the coin showed the head. For this kind
   of sampling the sample size can vary for different trials. The
   selection probability is not necessarily the same for each packet
   and can depend on other parameters (e.g. the packet content)
   [DuGG02].

4.1.3   Stratified Sampling

   The basic idea behind stratified sampling is to increase the
   estimation accuracy by using a-priori information. The a-priori
   information is used to perform an intelligent grouping of the
   elements of the parent population. With this a higher estimation
   accuracy can be achieved with the same sample size.

   Stratified sampling divides the sampling process into multiple
   steps. First, the elements of the parent population are grouped into
   subsets in accordance to a given characteristic. This grouping can
   be done in multiple steps. Then samples are taken from each subset.

   The stronger the correlation between the characteristic used to
   divide the parent population and the characteristic of interest (for
   which an estimate is sought after), the easier is the consecutive
   sampling process and the higher is the stratification gain. For

Zseby                       Expires February 2003            [Page 5]

Internet Draft  Sampling Techniques for Packet Selection    August 2002


   instance if the dividing characteristic were equal to the
   investigated characteristic, each element of the sub-group would be
   a perfect representative of that characteristic. In this case it
   would be sufficient to take one arbitrary element out of each
   subgroup to get the actual distribution of the characteristic in the
   parent population. Therefore stratified sampling can reduce the
   costs for the sampling process (i.e. the number of samples needed to
   achieve a given level of confidence).

4.2 Sampling Frequency and Interval-Length

   According to [AmCa89] and [ClPB93] we differentiate sampling
   techniques by the event that triggers the sampling process. The
   trigger determines what kind of event starts and stops the sampling
   intervals. With this the sampling frequency and the length of the
   sampling interval (measured in packets or time) is determined. It is
   also possible to combine start and stop triggers of different types
   (e.g. start a 10 s measurement interval every n-th packet).
   Nevertheless, due to the unknown relation between number of packets
   and duration of an interval this can lead to unexpected overlapping
   of sampling intervals. We distinguish the following techniques:

4.2.1   Count-based Trigger

   With this method the packet count triggers the start and stop of a
   sampling interval. One example is the systematic sampling of every
   n-th packet of a specific type. For count-based sampling it is
   necessary to integrate a packet counter into the meter. Since non-
   intrusive measurements are based on the traffic in the network only,
   the time that it takes until the n packets of a specific type are
   seen by the probe is unknown. This means the duration of the
   sampling process is undetermined (and can be infinite) if the
   sampling goal requires a minimum sampling size (number of packets).

4.2.2   Time-based Trigger

   In time-based sampling the arrival time of a packet at the meter
   determines whether this packet is captured or not. One example is to
   capture packets every 30 seconds. If the stop trigger is also a
   point in time the sampling interval length is given as the time
   duration between this two points. Since it is unknown how many
   packets arrive in a specific time interval the number of packets
   captured with this technique is unknown (and can be zero). This has
   to be taken into account if a minimum sampling size is required.

4.2.3   Packet-content-based Trigger

   With this method the content (or parts of the content) of the packet
   itself (header, payload or both) triggers the sampling process. This
   can be achieved by direct comparison of parts of the packet with a
   reference pattern [CoGi98] or by matching the result of a function
   performed on packet content [DuGr00].


Zseby                       Expires February 2003            [Page 6]

Internet Draft  Sampling Techniques for Packet Selection    August 2002


5. Sampling Parameters

   The decision whether to select a packet or not is based on a
   function which takes packet properties and sampling parameters as
   input. The sampling parameters usually remain the same for the
   sampling process and are pre-defined by the administrator. A special
   case are sampling parameters that depend on packet properties (e.g.
   selection probability dependent on packet content). In such cases
   only the function which describes the dependency is fixed in
   advance. Packet properties are examined per packet and are only
   available after the packet has arrived at the meter.

                        sampling parameters
                                |
                                V
                       +-------------------+
   packet properties   | sampling function |
   ------------------->|                   |----> selected/not selected
                       +-------------------+


   Selection decision = f(sampling parameters, packet properties)


   Packet properties are: packet position, arrival time, packet content
   (header fields, parts of payload) and observation point.

   Which packet properties are used as input for the sampling function
   is determined by the used sampling algorithm. For count based
   sampling the packet position is used as input. For time-based
   sampling the arrival time and for content-based sampling (parts of)
   the packet content (e.g. header fields). It is also possible, that
   the algorithm differs with regard to the observation point on which
   the packet was observed. The sampling parameters differ for the
   different sampling technique.

5.1 Parameters for systematic sampling

   For systematic sampling the deterministic function which is used for
   the packet selection needs to be given. For periodic sampling the
   start of the first selection interval, the length of the selection
   interval (given in number of packets or as time duration) and the
   spacing between selection intervals needs to be specified.

                  <-- interval length = 7 --> <-- spacing = 5 -->
   Paket position: 1   2   3   4   5   6   7   8   9  10   11  12  13..

   In sample: 1,2,3,4,5,6,7, 13,...

   Selecting every x-th packet would be a special case with selection
   interval=1 and spacing=x-1.


Zseby                       Expires February 2003            [Page 7]

Internet Draft  Sampling Techniques for Packet Selection    August 2002


5.2 Parameters for random sampling

   For random n-out-of-N sampling only the sample size n needs to be
   specified. This can be done either as an absolute number or as
   fraction of the parent population n/N.

   For probabilistic sampling the selection probability p needs to be
   specified. If the selection probability depends on other parameters
   (e.g. packet content), the function that expresses this dependency
   has to be specified.

5.3 Parameters for stratified sampling

   For stratified sampling one has to specify classification rules for
   grouping the elements into subgroups and the sampling scheme that is
   used within the subgroups. For the sampling scheme within the
   subgroups the parameters have to be specified as described above.

6. Security Considerations

   Security threats can occur if the configuration of sampling
   parameters or the communication of sampling parameters to the
   application is corrupted. This document only describes sampling
   schemes that can be used for packet selection. It neither describes
   a mechanism how those parameters are configured nor how these
   parameters are communicated to the application. Therefore the
   security threats that originate from this kind of communication
   cannot be assessed with the information given in this document.

7. References

   [AmCa89]    Paul D. Amer, Lillian N. Cassel: Management of Sampled
               Real-Time Network Measurements, 14th Conference on Local
               Computer Networks, October 1989, Minneapolis, pages 62-
               68, IEEE, 1989

   [ClPB93]    K.C. Claffy, George C Polyzos, Hans-Werner Braun:
               Application of Sampling Methodologies to Network Traffic
               Characterization, Proceedings of ACM SIGCOMM'93, San
               Francisco, CA, USA, September 13 - 17, 1993

   [CoGi98]    I. Cozzani, S. Giordano: Traffic Sampling Methods for
               end-to-end QoS Evaluation in Large Heterogeneous
               Networks. Computer Networks and ISDN Systems, 30 (16-
               18), September 1998.

   [DuGG02]    Nick Duffield, Albert Greenberg, Matthias Grossglauser,
               Jennifer Rexford: A Framework for Passive Packet
               Measurement, Internet Draft draft-duffield-framework-
               papame-01, work in progress, February 2002

   [DuGr00]    Nick Duffield, Matthias Grossglauser: Trajectory
               Sampling for Direct Traffic Observation, Proceedings of

Zseby                       Expires February 2003            [Page 8]

Internet Draft  Sampling Techniques for Packet Selection    August 2002


               ACM SIGCOMM 2000, Stockholm, Sweden, August 28 -
               September 1, 2000.

   [JePP92]    Jonathan Jedwab, Peter Phaal, Bob Pinna: Traffic
               Estimation for the Largest Sources on a Network, Using
               Packet Sampling with Limited Storage, HP technical
               report, Managemenr, Mathematics and Security Department,
               HP Laboratories, Bristol, March 1992,
               h
                ttp://www.hpl.hp.com/techreports/92/HPL-92-35.htm 
                                                                 l

   [QuZC02]    J. Quittek, T. Zseby, B. Claise, S. Zander, G. Carle,
               K.C. Norseth: Requirements for IP Flow Information
               Export, Internet Draft <draft-ietf-ipfix-reqs-05.txt>,
               work in progress, August 2002

   [Zseb02]    Tanja Zseby: Deployment of Sampling Methods for SLA
               Validation with Non-Intrusive Measurements, Proceedings
               of Passive and Active Measurement Workshop (PAM 2002),
               Fort Collins, CO, USA, March 25-26, 2002

8. Author's Addresses

   Tanja Zseby
   Fraunhofer Institute for Open Communication Systems
   Kaiserin-Augusta-Allee 31
   10589 Berlin
   Germany
   Phone: +49-30-34 63 7153
   Fax:   +49-30-34 53 8153
   Email: zseby@fokus.fhg.de

9. Full Copyright Statement

   Copyright (C) The Internet Society (2002). All Rights Reserved. This
   document and translations of it may be copied and furnished to
   others, and derivative works that comment on or otherwise explain it
   or assist in its implementation may be prepared, copied, published
   and distributed, in whole or in part, without restriction of any
   kind, provided that the above copyright notice and this paragraph
   are included on all such copies and derivative works. However, this
   document itself may not be modified in any way, such as by removing
   the copyright notice or references to the Internet Society or other
   Internet organizations, except as needed for the purpose of
   developing Internet standards in which case the procedures for
   copyrights defined in the Internet Standards process must be
   followed, or as required to translate it into languages other than
   English.

   The limited permissions granted above are perpetual and will not be
   revoked by the Internet Society or its successors or assigns.

   This document and the information contained herein is provided on an


Zseby                       Expires February 2003            [Page 9]

Internet Draft  Sampling Techniques for Packet Selection    August 2002


   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Zseby                       Expires February 2003            [Page 10]