Internet Draft Document: T. Zseby Expires: December 2003 Fraunhofer FOKUS M. Molina NEC Europe Ltd. F. Raspall NEC Europe Ltd. Nick Duffield AT&T Labs - Research June 2003 Sampling and Filtering Techniques for IP Packet Selection Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes sampling and filtering techniques for IP packet selection. It introduces information models for packet sampling, for packet filtering and for combinations of methods. The information models describe what information has to be specified in order to describe the method. This information is used for configuring the selection technique in measurement processes and for reporting the technique in use to the measurement data collection process. The document first suggests some terminology, then it describes in detail packet sampling and packet filtering techniques and their parameters. It also describes how these two techniques can be combined to build more elaborate packet selectors. Finally, it introduces information models for the most common sampling and filtering techniques. Table of Contents 1. Introduction.................................................2 2. Terminology..................................................2 3. Scope and Deployment of Packet Selection Techniques..........4 3.1 Sampling.....................................................5 Zseby, Molina, Raspall, Duffield Expires December 2003 [Page 1] Internet Draft Techniques for IP Packet Selection June 2003 3.1.1 Systematic Sampling.........................................5 3.1.2 Random Sampling.............................................6 3.1.2.1 n-out-of-N sampling.......................................6 3.1.2.2 Probabilistic sampling....................................6 3.1.2.2.1 Uniform Probabilistic Sampling..........................7 3.1.2.2.2 Non-Uniform Probabilistic Sampling......................7 3.1.2.2.3 Non-Uniform flow State dependent sampling...............7 4. Filtering....................................................7 4.1 Mask/Match filtering.........................................8 4.2 Hashing filtering............................................8 4.2.1 Random sampling emulation...................................9 4.2.2 Consistent packet selection and its applications............9 4.2.3 Guarding Against Pitfalls and Vulnerabilities..............10 4.3 Router State filtering......................................10 5. Input Parameters and Information Models.....................11 5.1 Information Model for Sampling Techniques...................11 5.2 Information Model for Filtering Techniques..................13 6. Composite Techniques........................................15 6.1 Cascaded filtering->sampling or sampling->filtering.........15 6.2 Stratified Sampling.........................................15 7. Security Considerations.....................................16 8. References..................................................16 9. Author's Addresses..........................................17 10. Intellectual Property Statement.............................18 11. Full Copyright Statement....................................18 1. Introduction Increasing data rates and growing measurement demands increase the requirements for data collection resources. For measurement scenarios in backbone networks it is often required to measure whole traffic aggregates instead of single flows. Furthermore, some measurement methods require the capturing of packet headers or even parts of the payload. All this can lead to an overwhelming amount of measurement data, resulting in high demands regarding resources for metering, storage, transport and post processing. In some cases specialized hardware helps to fulfill these demands but on the other hand increases the costs for providing the measurement. Since measurements are mainly a supporting functionality for the service provisioning, measurement costs usually should be limited to a small fraction of the costs of the network service provisioning itself. Therefore a reduction of the measurement result data is crucial to prevent the depletion of the available (i.e. the affordable) resources. Such a reduction can be achieved by a reasonable deployment of packet selection techniques, that sample a subset of the packets while still allowing an appropriate accuracy, or filter out all packets that are not of interest for the measurement at all. Packet selection helps to prevent an exhaustion of resources and to limit the measurement costs. Examples for applications that benefit from packet selection are given in [DuGG03]. 2. Terminology The terms used throughout this document are defined in [DuGG03]. We here just repeat the definitions of the specific selection operations. Zseby, Molina, Raspall, Duffield Expires December 2003 [Page 2] Internet Draft Techniques for IP Packet Selection June 2003 Filtering: a filter is a selection operation that selects a packet deterministically based on the packet content, its treatment, and functions of these occurring in the selection state. Examples include match/mask filtering, and hash-based selection. Sampling: a selection operation that is not a filter is called a sampling operation. Content-independent Sampling: a sampling operation that does not use packet content (or quantities derived from it) as the basis for selection is called a content-independent sampling operation. Examples include systematic sampling, and uniform pseudorandom sampling driven by a pseudorandom number whose generation is independent of packet content. Note that independent sampling a does not need to access the packet content in order to make the selection decision. Content-dependent Sampling: a sampling operation where selection is dependent on packet content is called a content-dependent sampling operation. Examples include pseudorandom selection according to a probability that depends on the contents of a packet field; note that this is not a filter. Emulated Sampling: selection operations in any of the above four categories may be emulated by operations in the same or another category for the purposes of implementation. Hash-based selection: a filter specified by a hash domain, a hash function, and hash range and a hash selection range. Hash domain: a subset of the packet content and the packet treatment, viewed as a N-bit string for some positive integer N. Hash range: a set of M-bit strings for some positive integer M Hash function: a deterministic map from the hash domain into the hash range. Hash selection range: a subset of the hash range. The packet is selected if the action of the hash function on the hash domain for the packet yields a result in the hash selection range. Metering process: see the definition in [QuZC03] Pool size: the size of a set of packets in a packet stream. Sample size: the size of a set of packets selected by a sampling operation. Target Sampling Rate: a configurable sampling rate in a sampling operation. Attained Sampling Rate: Given a subset of packets in a stream input to a sampling operation, the attained sampling rate is the ratio of the sample size to the pool size. Packet Stream: a sequence of packets, each of which was observed at the observation point. Note that when packets are sampled from a stream, the selected packets usually do not have common properties by which they can be distinguished from packets that have not been Zseby, Molina, Raspall, Duffield Expires December 2003 [Page 3] Internet Draft Techniques for IP Packet Selection June 2003 selected. Therefore we define here the term stream instead of flow, which is defined as set of packets with common properties [QuZC02]. Selection Process: a selection process selects a substream of packets from the observed packet stream. A selection process entails the composition of one or more selectors operating in succession. If selectors are composed, the output packet stream of one selector forms the input packet stream for the succeeding selector. 3. Scope and Deployment of Packet Selection Techniques The function of packet selection is to select a subset from the stream of all packets visible at an observation point. Selection can be used to select packets based on their content, and/or to reduce the rate of packet reports regardless of content. The selection technique used to select a subset of packets out of all those crossing an observation point depends on the purpose (application) for which measurement is performed. If the main purpose of an application is to infer some characteristic of the whole set of crossing packets without processing them all (thus reducing the computation load) then we call the used selection technique ôsamplingö. In principle, with sampling the content of the packet is not relevant for the packet selection: what matters is only that the selected sample has a distribution of the characteristic to infer similar to the one of the parent population, so that it can be estimated reliably. The sampling decision may be based on the temporal or spatial position of the packet in the packet stream, or may depend on a (pseudo) random number extraction or calculation. On the contrary, if the application needs to consider all the packets having some common property, then we call the selection technique ôfilteringö. The property can be directly derived by some computation on the packet content, or depend on the treatment given by the router to the packet. We conclude that sampling can depend on packet position or on (pseudo) random decisions, while filtering depends on packet content, but never depends on packet position or on (pseudo) random decisions. Note that a common technique to select packets is to compute a hash function on some bits of the packet header and/or content and to select it if the result falls in a certain selection range. Since hashing is a deterministic operation, it is a powerful mean to ensure that the same packets are selected at multiple measurement points. Depending on the chosen input bits, on the hash function and on the selection range, this technique could also be used to emulate the random selection of packets with a given probability p. Hashing is then a particular type of filtering, but can also be used to emulate random sampling. The introduced classification is mainly useful for the definition of an information model describing ôprimitiveö selection techniques. More complex selection techniques may then be described through the composition of cascaded sampling and filtering operations. For example, a packet selection that weights the selection probability on the basis of the packet length can be described as a set of filter/sampling cascades. However, this descriptive approach is not intended to be rigid: if a common and consolidated selection Zseby, Molina, Raspall, Duffield Expires December 2003 [Page 4] Internet Draft Techniques for IP Packet Selection June 2003 practice turns out to be too complex to be described as a composition of the mentioned building blocks, an ad hoc description can be specified instead. We consider packet selectors as part of an IPFIX metering process which also can use the IPFIX exporting process. This is expressed as association to one or more IPFIX processes. 3.1 Sampling The deployment of sampling techniques aims at the provisioning of information about a specific characteristic of the parent population at a lower cost than a full census would demand. In order to plan a suitable sampling strategy it is therefore crucial to determine the needed type of information and the desired degree of accuracy in advance. First of all it is important to know the type of metric that should be estimated. The metric of interest can range from simple packet counts [JePP92] up to the estimation of whole distributions of flow characteristics (e.g. packet sizes)[ClPB93]. Secondly, the required accuracy of the information and with this, the confidence that is aimed at, should be known in advance. For instance for usage-based accounting the required confidence for the estimation of packet counters can depend on the monetary value that corresponds to the transfer of one packet. That means that a higher confidence could be required for expensive packet flows (e.g. premium IP service) than for cheaper flows (e.g. best effort). The accuracy requirements for validating a previously agreed quality can also vary extremely with the customer demands. These requirements are usually determined by the service level agreement (SLA). Sampling is considered as part of the metering process. A metering process consists of multiple functions (capturing, time stamping, etc.). Sampling can be applied at different functions of the metering process. In the following we consider a measured IP packet with its observation point and timestamp as basis elements of the parent population. The sampling method and the parameters in use must be clearly communicated to all applications that use the measurement data. Only with this knowledge a correct interpretation of the measurement results can be ensured. Sampling Methods can be characterized by the sampling algorithm, the trigger type used for starting a sampling interval and the length of the sampling interval. These parameters are described here in detail. The sampling algorithm describes the basic process for selection of samples. In accordance to [AmCa89] and [ClPB93] we define the following basic sampling processes: 3.1.1 Systematic Sampling Systematic sampling describes the process of selecting the starting points and the duration of the selection intervals according to a deterministic function. This can be for instance the periodic selection of every n-th element of a trace but also the selection of all packets that arrive at pre-defined points in time. Even if the selection process does not follow a periodic function (e.g. if the Zseby, Molina, Raspall, Duffield Expires December 2003 [Page 5] Internet Draft Techniques for IP Packet Selection June 2003 time between the sampling intervals varies over time) we consider this as systematic sampling as long as the selection is deterministic. The use of systematic sampling always involves the risk of biasing the results. If the systematics in the sampling process resembles systematics in the observed stochastic process (occurrence of the characteristic of interest in the network), there is a high probability that the estimation will be biased. Systematics (e.g. periodic repetition of an event) in the observed process might not be known in advance. Here only equally spaced schemes are considered, where triggers for sampling are periodic, either in time or in packet count. All packets occurring in a selection interval (either in time or packet count) beyond the trigger are selected. The case that the selection interval covers only the first available packet for count-based sampling is often called 1 in N sampling: packets are selected with count period N. More generally, some number M With STREAM ID: Observation point ID | List of SELECTOR_IDs 5.2 Information Model for Filtering Techniques In this section we define the information models for most common filtering techniques. The information model structure closely parallels the one presented for the sampling techniques. Filter Description: SELECTOR_ID SELECTOR_TYPE SELECTOR_PARAMETERS OPERATING_TIME ASSOCIATIONS Where: SELECTOR_ID: Unique ID for the packet filter. The ID can be calculated under consideration of the ASSOCIATIONS and a local ID. SELECTOR_TYPE Description: For filtering processes the SELECTOR TYPE defines what filtering type is used. Values: Matching | Hashing | Router_state SELECTOR_PARAMETERS Description: For filtering processes the SELECTOR PARAMETERS define formally the common property of the packet being filtered. For the filters of type Matching and Hashing the definitions have a lot of points in common. Values: Case Matching -
- - -
- - - - - Notes to Case Matching: - The filter can be defined for the header part only, for the payload part only or for both. In the latter case the matching must be an AND of the two. - The bit specification, for the header part, can be specified for ipv4 or ipv6 only, or both - In case of ipv4, the bit specification is a sequence of 20 Hexadecimal numbers [00,FF] specifying a 20 bytes bitmask to be applied to the header Zseby, Molina, Raspall, Duffield Expires December 2003 [Page 13] Internet Draft Techniques for IP Packet Selection June 2003 - In case of ipv6, it is a sequence of 40 Hexadecimal numbers [00,FF] specifying a 40 bytes bitmask to be applied to the header - The bit specification, for the payload part, is a sequence of Hexadecimal numbers [00,FF] specifying the bitmask to be applied to the first N bytes of the payload, as specified by the previous field. In case the Hexadecimal number sequence is longer then N, only the first N numbers are considered. - In case the payload is shorter than N, the packet will not match the filter Other options, like padding with zeros, may be considered in the future. - The selection interval specification is a list of non overlapping intervals [intv_begin, intv_end] where intv_begin, intv_end are bit strings of length 20*8 (ipv4 case), 40*8 (ipv6 case), N*8 (payload case). - A filter cannot be defined on the options field of the ipv4 header, neither on stacked headers of ipv6. - This specification doesnÆt preclude the future definition of a high level syntax for defining in a concise way bit selection and matching rules in a more human readable form (e.g. ôTCP port in [2000,3000]ö). The requirement is that such a syntax can be univoquely compiled into the one defined above Case Hashing: -
- -
- - - - - Hashing function specification (includes length of hash function output M) - Selection interval specification, as a list of non overlapping intervals [start value, end value] where value is in [0,2^M-1] Notes to Case Hashing: - On Input bit specifications fields, the same notes on bit specifications of the Matching case reported above apply Case Router State: - Ingress interface at which packet arrives equals a specified value - Egress interface to which packet is routed to equals a specified value - Packet violated acl on the router - Failed rpf - Failed rsvp - No route found for the packet - Origin AS equals a specified value or lies within a given range - Destination AS equals a specified value or lies within a given range Note to Case Router State: - All Router state entries can be linked by AND, OR, NOT operators Zseby, Molina, Raspall, Duffield Expires December 2003 [Page 14] Internet Draft Techniques for IP Packet Selection June 2003 OPERATING_TIME Description: The OPERATING_TIME parameter describes the start/stop time of filtering process. List elements must not overlap. The start time of the first element can be omitted, meaning ôfrom nowö. The end time of the last element can be omitted, meaning ôuntil sampler is removedö. Values: List of (Start time, End time) ASSOCIATIONS Description: The ASSOCIATIONS field describes the observation point and the IPFIX processes to which the packet selector is associated. The STREAM ID denotes the origin of the data stream that is input to the selection function. It can be the observation point directly or the ID of another selector. With this it is possible to define combined schemes. If the STREAM ID contains IDs from other selectors, one can derive the original observation point from the selector definitions of these specified selectors. Values: STREAM ID, Metering process ID, Exporting process ID> With STREAM ID: Observation point ID | List of SELECTOR_IDs 6. Composite Techniques Composite schemes are realized by using the STREAM ID in the information models. The STREAM ID denotes from which selectors the input stream originates. If multiple stream IDs are given, this means that the selector operates on the packet stream simply resulting from the time superposition of the output of all the listed filters and samplers. Note that a sampler/filter could be intermittently active, as defined in the OPERATING TIME field. Some examples of composite schemes are reported below. 6.1 Cascaded filtering->sampling or sampling->filtering If a filter precedes a sampling process the role of filtering is to create a set of ôparent populationsö from a single stream that can then be fed independently to different sampling functions, with different parameters tuned for the population itself (e.g. if streams of different intensity result from filtering, it may be good to have different sampling rates). If filtering follows a sampling process, the same sampling rate and type is applied to the whole stream, independently of the relative size of the streams resulting from the filtering function. Moreover, also packets not destined to be selected will ôloadö the sampling function. So, in principle, filtering before sampling allows a more accurate tuning of the sampling procedure, but if filters are too complex to work at full line rate (e.g. because they have to access router state information), sampling before filtering may be a need. 6.2 Stratified Sampling Stratified sampling is one example for using a composite technique. The basic idea behind stratified sampling is to increase the estimation accuracy by using a-priori information. The a-priori information is used to perform an intelligent grouping of the elements of the parent population. With this a higher estimation accuracy can be achieved with the same sample size. Stratified sampling divides the sampling process into multiple steps. First, the elements of the parent population are grouped into subsets Zseby, Molina, Raspall, Duffield Expires December 2003 [Page 15] Internet Draft Techniques for IP Packet Selection June 2003 in accordance to a given characteristic. This grouping can be done in multiple steps. Then samples are taken from each subset. The stronger the correlation between the characteristic used to divide the parent population and the characteristic of interest (for which an estimate is sought after), the easier is the consecutive sampling process and the higher is the stratification gain. For instance if the dividing characteristic were equal to the investigated characteristic, each element of the sub-group would be a perfect representative of that characteristic. In this case it would be sufficient to take one arbitrary element out of each subgroup to get the actual distribution of the characteristic in the parent population. Therefore stratified sampling can reduce the costs for the sampling process (i.e. the number of samples needed to achieve a given level of confidence). For stratified sampling one has to specify classification rules for grouping the elements into subgroups and the sampling scheme that is used within the subgroups. The classification rules can be expressed by multiple filters. For the sampling scheme within the subgroups the parameters have to be specified as described above. 7. Security Considerations Security threats can occur if the configuration of sampling parameters or the communication of sampling parameters to the application is corrupted. This document only describes sampling schemes that can be used for packet selection. It neither describes a mechanism how those parameters are configured nor how these parameters are communicated to the application. Therefore the security threats that originate from this kind of communication cannot be assessed with the information given in this document. In some cases malicious users or attackers may be interested to hide packets from the service provider. For instance if packet selectors are used for accounting or intrusion detection applications, users may want to prevent that packets are selected. If a deterministic sampling scheme is used or a selection scheme that takes packet content into account, the user can shape or send packets in a way that they are less likely to be selected. This has to be taken into account when choosing an appropriate packet selection technique. 8. References [AmCa89] Paul D. Amer, Lillian N. Cassel: Management of Sampled Real-Time Network Measurements, 14th Conference on Local Computer Networks, October 1989, Minneapolis, pages 62- 68, IEEE, 1989 [ClPB93] K.C. Claffy, George C Polyzos, Hans-Werner Braun: Application of Sampling Methodologies to Network Traffic Characterization, Proceedings of ACM SIGCOMM'93, San Francisco, CA, USA, September 13 - 17, 1993 [CoGi98] I. Cozzani, S. Giordano: Traffic Sampling Methods for end-to-end QoS Evaluation in Large Heterogeneous Networks. Computer Networks and ISDN Systems, 30 (16- 18), September 1998. Zseby, Molina, Raspall, Duffield Expires December 2003 [Page 16] Internet Draft Techniques for IP Packet Selection June 2003 [DuGG03] Nick Duffield, Albert Greenberg, Matthias Grossglauser, Jennifer Rexford: A Framework for Passive Packet Measurement, Internet Draft draft-ietf-psamp-framework- 01, work in progress, June 2003 [DuGr00] Nick Duffield, Matthias Grossglauser: Trajectory Sampling for Direct Traffic Observation, Proceedings of ACM SIGCOMM 2000, Stockholm, Sweden, August 28 - September 1, 2000. [EsVa01] C. Estan and G. Varghese, ôNew Directions in Traffic Measurement and Accountingö, ACM SIGCOMM Internet Measurement Workshop 2001, San Francisco (CA) Nov. 2001 [HT52] D.G. Horvitz and D.J. Thompson, A Generalization of Sampling without replacement from a Finite Universe. J. Amer. Statist. Assoc. Vol. 47, pp. 663-685, 1952. [JePP92] Jonathan Jedwab, Peter Phaal, Bob Pinna: Traffic Estimation for the Largest Sources on a Network, Using Packet Sampling with Limited Storage, HP technical report, Managemenr, Mathematics and Security Department, HP Laboratories, Bristol, March 1992, http://www.hpl.hp.com/techreports/92/HPL-92-35.html [Knuth98] Donald E. Knuth: The Art of Computer Programming, Volume 3: Searching and Sorting, Addison Wesley, 1998 [QuZC03] J. Quittek, T. Zseby, B. Claise, S. Zander: Requirements for IP Flow Information Export, Internet Draft , work in progress, June 2002 [Zseb02] Tanja Zseby: Deployment of Sampling Methods for SLA Validation with Non-Intrusive Measurements, Proceedings of Passive and Active Measurement Workshop (PAM 2002), Fort Collins, CO, USA, March 25-26, 2002 9. Author's Addresses Tanja Zseby Fraunhofer Institute for Open Communication Systems Kaiserin-Augusta-Allee 31 10589 Berlin Germany Phone: +49-30-34 63 7153 Fax: +49-30-34 53 8153 Email: zseby@fokus.fhg.de Maurizio Molina NEC Europe Ltd., Network Laboratories Adenauerplatz 6 69115 Heidelberg Germany Phone: +49 6221 90511-18 Email: molina@ccrle.nec.de Fredric Raspall NEC Europe Ltd., Network Laboratories Adenauerplatz 6 69115 Heidelberg Zseby, Molina, Raspall, Duffield Expires December 2003 [Page 17] Internet Draft Techniques for IP Packet Selection June 2003 Germany Phone: +49 6221 90511-31 EMail: raspall@ccrle.nec.de Nick Duffield AT&T Labs - Research Room B-139 180 Park Ave Florham Park NJ 07932, USA Phone: +1 973-360-8726 Email: duffield@research.att.com 10. Intellectual Property Statement AT&T Corporation may own intellectual property applicable to this contribution. The IETF has been notified of AT&T's licensing intent for the specification contained in this document. See http://www.ietf.org/ietf/IPR/ATT-GENERAL.txt for AT&T's IPR statement. 11. Full Copyright Statement Copyright (C) The Internet Society (2002). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Zseby, Molina, Raspall, Duffield Expires December 2003 [Page 18]