Internet Draft Document: T. Zseby Expires: April 2003 Fraunhofer FOKUS M. Molina NEC Europe Ltd. F. Raspall NEC Europe Ltd. Nick Duffield AT&T Labs - Research October2003 Sampling and Filtering Techniques for IP Packet Selection Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes sampling and filtering techniques for IP packet selection. In two information models (one for sampling, one for filtering) it defines what parameters are needed to describe the most common selection schemes and shows how techniques can be combined to build more elaborate packet selectors. The information models are used for configuring the selection technique in measurement processes and for reporting the technique in use to a collector. Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 1] Internet Draft Techniques for IP Packet Selection October 2003 Table of Contents 1. Introduction.................................................2 2. Terminology..................................................3 3. Scope and Deployment of Packet Selection Techniques..........8 3.1 Sampling.....................................................9 3.1.1 Systematic Sampling........................................10 3.1.2 Random Sampling............................................10 3.1.2.1 n-out-of-N sampling......................................11 3.1.2.2 Probabilistic sampling...................................11 3.1.2.2.1 Uniform Probabilistic Sampling.........................11 3.1.2.2.2 Non-Uniform Probabilistic Sampling.....................11 3.1.2.2.3 Non-Uniform flow State dependent sampling..............12 3.1.3 Overview of sampling schemes regarding content dependence..12 4. Filtering...................................................13 4.1 Mask/Match filtering........................................14 4.2 Hashing filtering...........................................14 4.2.1 Examples of Hashing Filtering applications.................15 4.2.1.1 Random sampling emulation................................15 4.2.1.2 Consistent packet selection and its applications.........15 4.2.2 Guarding Against Pitfalls and Vulnerabilities..............16 4.3 Router State filtering......................................17 5. Input Parameters and Information Models.....................17 5.1 Information Model for Sampling Techniques...................18 5.2 Information Model for Filtering Techniques..................20 6. Composite Techniques........................................23 6.1 Cascaded filtering->sampling or sampling->filtering.........23 6.2 Stratified Sampling.........................................23 7. Security Considerations.....................................24 8. References..................................................25 9. Author's Addresses..........................................26 10. Intellectual Property Statement.............................27 11. Full Copyright Statement....................................27 1. Introduction Increasing data rates and growing measurement demands increase the requirements for data collection resources. For measurement scenarios in backbone networks it is often required to measure whole traffic aggregates instead of single flows. Furthermore, some measurement methods require the capturing of packet headers or even parts of the payload. All this can lead to an overwhelming amount of measurement data, resulting in high demands regarding resources for metering, storage, transport and post processing. In some cases specialized hardware helps to fulfill these demands but on the other hand increases the costs for providing Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 2] Internet Draft Techniques for IP Packet Selection October 2003 the measurement. Since measurements are mainly a supporting functionality for the service provisioning, measurement costs usually should be limited to a small fraction of the costs of the network service provisioning itself. A reduction of the data that is considered and reported by a measurement process is crucial to prevent the depletion of the available (i.e. the affordable) resources. Such a reduction can be achieved by a reasonable deployment of packet selection techniques, that sample a subset of the packets while still allowing an appropriate accuracy, or filter out all packets that are not of interest for the measurement at all. Packet selection helps to prevent an exhaustion of resources and to limit the measurement costs. Examples for applications that benefit from packet selection are given in [DuGG03]. 2. Terminology The terms used throughout this document are consistent with those defined in [DuGG03]. 2.1 Terminology for Observation and Selection * Observation Point: a location in the network where a packet stream is observed. Examples include: - a line to which a probe is attached; - a shared medium, such as an Ethernet-based LAN; - a single port of a router, or set of interfaces (physical or logical) of a router; - an embedded measurement subsystem within an interface. * Measurement Process: the combination of a selection process followed by a reporting process. * Observed Packet Stream: the set of all packets observed at the observation point. * Packet Stream: either the observed packet stream, or a subset of it that forms the output of a selection process. Note that packets selected from a stream, e.g. by sampling, do not necessarily possess a property by which they can be distinguished from packets that have not been selected. For this reason the term ‘‘stream’’ is favored over ‘‘flow’’, which is defined as set of packets with common properties [QuZC02]. Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 3] Internet Draft Techniques for IP Packet Selection October 2003 * Selection Process: takes a packet stream as its input and selects a subset of that stream as its output. * Selector: defines the action of a selection process on a single packet of its input. A selected packet becomes an element of the output packet stream of the selection process. The selector can make use of the following information in determining whether a packet is selected: - the packet’s content; - information derived from the packet's treatment at the observation point; - any selection state that may be maintained by the selection process. * Packet Content: the union of the packet header (which includes link layer, network layer and other encapsulation headers) and the packet payload. * Selection State: a selection process may maintain state information for use by the selection process and/or the reporting process. At a given time, the selection state may depend on packets observed at and before that time, and other variables. Examples include: - sequence numbers of packets at the input of selectors; - a timestamp of arrival of the packet at the observation points; - iterators for pseudorandom number generators; - hash values calculated during selection; - indicators of whether the packet was selected by a given selector. Selectors may change the selection state. * Composite Selection Processes: selection processes may be composed, in which case the output stream issuing from one of the selection processes in the composition forms the input stream for the succeeding selection process. The composition of selection processes is also a selection process; the elements of the composition can be viewed as Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 4] Internet Draft Techniques for IP Packet Selection October 2003 selection subprocesses of the composite. A selection process that is not a subprocess of another selection process takes the observed packet stream as its input. * Primitive Selection Process: a selection process that is not a composition of other selection processes. * Composite Selector: the selector of a composite selection process. * Primitive Selector: the selector of a primitive selection process. 2.2 Terminology for Reporting, Export and Collection * Reporting Process: creates a report stream on packets selected by a selection process, in preparation for export. The input to the reporting process comprises that information available to the selection process per selected packet, specifically: - the selected packet’s content; - information derived from the selected packet's treatment at the observation point; - any selection state maintained by the inputting selection process, reflecting any modifications to the selection state made during selection of the packet. * Report Stream: the output of a reporting process is a report stream, comprising two distinguished types of information: packet reports, and report interpretation. * Packet Reports: a configurable subset of the per packet input to the reporting process. * Report Interpretation: subsidiary information relating to one or more packets, that is used for interpretation of their packet reports. Examples include configuration parameters of the measurement process and of the export process. * Export Process: sends the output of one or more reporting processes to one or more collectors. * Collector: a collector receives a report stream exported by one or more export processes. In some cases, the host of the measurement and/or export processes may also serve as the collector. Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 5] Internet Draft Techniques for IP Packet Selection October 2003 * Export packets: one or packet reports, and perhaps report interpretation, are bundled by the export process into a export packet for export to a collector. Various possibilities for the high level architecture of these elements is as follows. MP = Measurement Process, EP = Export Process +---------------------+ +------------------+ |Observation Point(s) | | Collector(1) | |MP(s)--->EP----------+---------------->| | |MP(s)--->EP----------+-------+-------->| | +---------------------+ | +------------------+ | +---------------------+ | +------------------+ |Observation Point(s) | +-------->| Collector(2) | |MP(s)--->EP----------+---------------->| | +---------------------+ +------------------+ +---------------------+ |Observation Point(s) | |MP(s)--->EP---+ | | | | |Collector(3)<-+ | +---------------------+ 2.3 Packet Selection Terminology. * Filtering: a filter is a selection operation that selects a packet deterministically based on the packet content, its treatment, and functions of these occurring in the selection state. Examples include mask/match filtering, and hash-based selection. * Sampling: a selection operation that is not a filter is called a sampling operation. This reflects the intuitive notion that if the selection of a packet cannot be exactly determined from its content alone, there must be some type of sampling taking place. * Content-independent Sampling: a sampling operation that does not use packet content (or quantities derived from it) as the basis for selection is called a content-independent sampling operation. Examples include systematic sampling, and uniform pseudorandom sampling driven by a pseudorandom number whose generation is independent of packet content. Note that in Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 6] Internet Draft Techniques for IP Packet Selection October 2003 independent sampling it is not necessary to access the packet content in order to make the selection decision. * Content-dependent Sampling: a sampling operation where selection is dependent on packet content is called a content- dependent sampling operation. Examples include pseudorandom selection according to a probability that depends on the contents of a packet field; note that this is not a filter. * Approximate Selection: selection operations in any of the above four categories may be approximated by operations in the same or another category for the purposes of implementation. For example, uniform pseudorandom sampling may be approximated by hash-based selection, using a suitable hash function and hash domain. In this case, the closeness of the approximation depends on the choice of hash function and hash domain. * Hash-based Selection: a filter specified by a hash domain, a hash function, and hash range and a hash selection range. * Hash Domain: a subset of the packet content and the packet treatment, viewed as an N-bit string for some positive integer N. * Hash Range: a set of M-bit strings for some positive integer M. * Hash Function: a deterministic map from the hash domain into the hash range. * Hash Selection Range: a subset of the hash range. The packet is selected if the action of the hash function on the hash domain for the packet yields a result in the hash selection range. * Population size: the number of packets in a subset of a packet stream. * Sample size: the number of packets selected from a subset of a packet stream by a selection operation. * Attained Selection Frequency: the actual proportion of packets selected by a selection operation from a given subset of a packet stream. Calculated as the ratio of the sample size to the population size for the subset. When the selection operation is a sampling operation, or approximates a sampling operation, we can speak of the Attained Sampling Frequency. Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 7] Internet Draft Techniques for IP Packet Selection October 2003 * Target Sampling Frequency: a configurable sampling frequency in a sampling operation. Due to the inherent statistical variability of sampling, the target and attained sampling rates will not in general be equal, although they may be close in some circumstances, e.g., when the population size is large. 3. Scope and Deployment of Packet Selection Techniques Packet selection techniques generate a subset of packets from an Observed Packet Stream at an observation point. We distinguish between sampling and filtering. Sampling is targeted at the selection of a representative subset of packets. The subset is used to infer knowledge about the whole set of observed packets without processing them all. . The selection can depend on packet position, and/or on packet content, and/or on (pseudo) random decisions. Filtering selects a subset with common properties. This is used if only a subset of packets is of interest. The properties can be directly derived from the packet content, or depend on the treatment given by the router to the packet. Filtering is a deterministic operation. It depends on packet content or router treatment. It never depends on packet position or on (pseudo) random decisions. Note that a common technique to select packets is to compute a Hash Function on some bits of the packet header and/or content and to select it if the Hash Value falls in the Hash Selection Range. Since hashing is a deterministic operation on the packet content, it is a filtering technique according to our categorization. Nevertheless, hash functions are sometimes used to emulate random sampling. Depending on the chosen input bits, the Hash Function and the Hash Selection Range, this technique can be used to emulate the random selection of packets with a given probability p. It is also a powerful technique to consistently select the same packet subset at multiple observation points [DuGr00] The introduced classification is mainly useful for the definition of an information model describing Primitive Selectors . More complex selection techniques may then be described through the composition of cascaded sampling and filtering operations. For example, a packet selection that weights the selection probability on the basis of the packet length can be described Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 8] Internet Draft Techniques for IP Packet Selection October 2003 as a set of filter/sampling cascades. However, this descriptive approach is not intended to be rigid: if a common and consolidated selection practice turns out to be too complex to be described as a composition of the mentioned building blocks, an ad hoc description can be specified instead. Packet selectors can be part of an IPFIX metering process which also can use the IPFIX exporting process. This is expressed as association to one or more IPFIX processes. 3.1 Sampling The deployment of sampling techniques aims at the provisioning of information about a specific characteristic of the parent population at a lower cost than a full census would demand. In order to plan a suitable sampling strategy it is therefore crucial to determine the needed type of information and the desired degree of accuracy in advance. First of all it is important to know the type of metric that should be estimated. The metric of interest can range from simple packet counts [JePP92] up to the estimation of whole distributions of flow characteristics (e.g. packet sizes)[ClPB93]. Secondly, the required accuracy of the information and with this, the confidence that is aimed at, should be known in advance. For instance for usage-based accounting the required confidence for the estimation of packet counters can depend on the monetary value that corresponds to the transfer of one packet. That means that a higher confidence could be required for expensive packet flows (e.g. premium IP service) than for cheaper flows (e.g. best effort). The accuracy requirements for validating a previously agreed quality can also vary extremely with the customer demands. These requirements are usually determined by the service level agreement (SLA). Sampling is considered as part of the metering process. A metering process consists of multiple functions (capturing, time stamping, etc.). Sampling can be applied at different functions of the metering process. In the following we consider a measured IP packet with its observation point and timestamp as basis elements of the parent population. The sampling method and the parameters in use must be clearly communicated to all applications that use the measurement data. Only with this knowledge a correct interpretation of the measurement results can be ensured. Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 9] Internet Draft Techniques for IP Packet Selection October 2003 Sampling methods can be characterized by the sampling algorithm, the trigger type used for starting a sampling interval and the length of the sampling interval. These parameters are described here in detail. The sampling algorithm describes the basic process for selection of samples. In accordance to [AmCa89] and [ClPB93] we define the following basic sampling processes: 3.1.1 Systematic Sampling Systematic sampling describes the process of selecting the start points and the duration of the selection intervals according to a deterministic function. This can be for instance the periodic selection of every k-th element of a trace but also the selection of all packets that arrive at pre-defined points in time. Even if the selection process does not follow a periodic function (e.g. if the time between the sampling intervals varies over time) we consider this as systematic sampling as long as the selection is deterministic. The use of systematic sampling always involves the risk of biasing the results. If the systematics in the sampling process resemble systematics in the observed stochastic process (occurrence of the characteristic of interest in the network), there is a high probability that the estimation will be biased. Systematics in the observed process might not be known in advance. Here only equally spaced schemes are considered, where triggers for sampling are periodic, either in time or in packet count. All packets occurring in a selection interval (either in time or packet count) beyond the trigger are selected. Systematic count-based In systematic count-based sampling the start and stop triggers for the sampling interval are defined in accordance to the special packet position (packet count). Systematic time-based In systematic count-based sampling the start and stop triggers for the sampling interval are defined in accordance to the temporal packet position (arrival time). Both schemes are content-independent selection schemes. 3.1.2 Random Sampling Random sampling selects the starting points of the sampling intervals in accordance to a random process. The selection of elements are independent experiments. With this, unbiased estimations can be achieved. In contrast to systematic sampling, Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 10] Internet Draft Techniques for IP Packet Selection October 2003 random sampling requires the generation of random numbers. One can differentiate two methods of random sampling: 3.1.2.1 n-out-of-N sampling In n-out-of-N sampling n elements are selected out of the parent population that consists of N elements. One example would be to generate n different random numbers in the range [1,N] and select all packets which have a packet position equal to one of the random numbers. For this kind of sampling the sample size n is fixed. 3.1.2.2 Probabilistic sampling In probabilistic sampling the decision whether an element is selected or not is made in accordance to a pre-defined selection probability. An example would be to flip a coin for each packet and select all packets for which the coin showed the head. For this kind of sampling the sample size can vary for different trials. The selection probability does not necessarily has to be the same for each packet. Therefore we distinguish between uniform probabilistic sampling (with the same selection probability for all packets) and non-uniform probabilistic sampling (where the selection probability can vary for different packets). 3.1.2.2.1 Uniform Probabilistic Sampling For Uniform Probabilistic Sampling packets are selected independently with some uniform probability p. This sampling can be count-driven, and is sometimes referred to as geometric random sampling, since the difference in count between successive selected packets are independent random variables with a geometric distribution of mean 1/p. A time-driven analog, exponential random sampling, has the time between triggers exponentially distributed. Both geometric and exponential random sampling are examples of what is known as additive random sampling, defined as sampling where the intervals or counts between successive samples are independent identically distributed random variable. 3.1.2.2.2 Non-Uniform Probabilistic Sampling This is a variant of Uniform Probabilistic Sampling in which the sampling probabilities can depend on the selection process input. This can be used to weight sampling probabilities in order e.g. to boost the chance of sampling packets that are rare Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 11] Internet Draft Techniques for IP Packet Selection October 2003 but are deemed important. Unbiased estimators for quantitative statistics are recovered by renormalization of sample values; see [HT52]. 3.1.2.2.3 Non-Uniform flow State dependent sampling Another type of sampling that can be classified as probabilistic Non-Uniform is closely related to the flow concept as defined in [QuZC02], and it is only used jointly with a flow monitoring function (IPFIX monitoring function). Packets are selected, dependent on a selection state. The point, here, is that the selection state is determined also by the state of the flow the packet belongs to and/or by the state of the other flows currently being monitored by the associated flow monitoring function. An example for such an algorithm is the ’’sample and hold’’ method described in [EsVa01]: - if a packet accounts for a flow record that already exists in the IPFIX flow recording process, select the packet (i.e. simply update the flow record) - if a packet doesn't account to any existing flow record, select it with probability p. If selected, create a new flow record for it. A further algorithm that fits into the category of non-uniform flow state dependent sampling is described in [Molina03]. This type of sampling is content dependent because the identification of the flow the packet belongs to requires analyzing part of the packet content. If the packet is selected, then it is passed as an input to the IPFIX monitoring function (this is called ’’Local Export’’ in [DuGG03] Selecting the packet depending on the state of a flow cache is useful when memory resources of the flow monitoring function are scarce (i.e. there’s no room to keep all the flows that have been scheduled for monitoring). See [MolFl03] for a more detailed description of the motivations for this type of sampling and the impact on the IPFIX metering. 3.1.3 Overview of sampling schemes regarding content dependence n-out-of-N sampling and uniform probabilistic sampling are content- -independent selection schemes. For non-uniform probabilistic sampling and non-uniform flow state dependent sampling the sampling probability can be based on packet content. The following table shows what schemes are content- independent and which can depend on the content. Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 12] Internet Draft Techniques for IP Packet Selection October 2003 Sampling Scheme | content- | content- | independent | dependent ------------------------------+---------------+--------------- systematic | X | count-based | | ------------------------------+---------------+--------------- systematic | X | time-based | | ------------------------------+---------------+--------------- random | X | n-out-of-N | | ------------------------------+---------------+--------------- random | X | uniform probabilistic | | ------------------------------+---------------+--------------- random | | X non-uniform probabilistic | | ------------------------------+---------------+--------------- random | | X non-uniform flow-state | | ------------------------------+---------------+--------------- 4. Filtering Filtering is the selection of packets based on the packet content, the treatment of the packet at the observation point, or deterministic functions of these occurring in the selection state. The packet is selected if these quantities fall into a specified range. Packet filtering is done for a wide variety of purposes e.g. for security, SLA enforcing, accounting. Depending on the type of filtering, it can be applied in different parts of the metering process. The role of filtering, as the word itself suggest, is to separate all the packets having a certain property from those not having it. A distinguishing characteristic from sampling is that the property never depends on the packet position in time or in the space, or on a random process. We identify and describe in the following three filtering techniques. The first two (Mask/Match filtering and Hashing filtering) are stateless, and therefore can make their decision based on the analysis of portion of the packet only. The other (router state filtering) requires to access state information after the analysis of part of the packet and is therefore more Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 13] Internet Draft Techniques for IP Packet Selection October 2003 complex: its usage makes sense only in particular circumstances, as described below. 4.1 Mask/Match filtering This type of filtering selects a packet operating as follows: first a combination of packet’s bit positions is selected taking the logical AND of portion of the packet’s bits and a mask, then it’s checked if the result, seen as an hexadecimal number, falls within a selection range. In the simplest case the selection range can be a single value (e.g. filter selects only the packets with a certain specific header field, e.g. a specific source IP address) but more in general the selection range can be a sequence of non-overlapping intervals if high level interfaces are used to specify mask and matches. The selected bits of the packet aren’t necessarily only those of the IP header, but in case both bits of the IP header and the payload are selected, the masks and the selection intervals MUST be specified separately for the header and the payload. An implementation is not constrained to apply exactly all the steps or in this sequence, provided that the result is equivalent to a logical function doing it. Examples of filters of this class are filters that select packets based on the matching of some of the IP header fields with a (possibly masked) specific value, filters that select packets having some IP header fields values falling within a range, filters that do the same as above on some of the transport header fields (that are thus considered as part of the payload), or a combination of all of the above mentioned possibilities. 4.2 Hashing filtering A Hash Function h maps the packet content c, or some portion of it, onto a Hash Range R. The packet is selected if h(c) is an element of S, which is a subset of R called the Hash Selection Range. Thus hash-based sampling is indeed a particular case of filtering: the object is selected if c is in inv(h(S)). But for desirable Hash Functions the inverse image inv(h(S)) will be extremely complex, and hence h would not be expressible as, say, a match/mask filter or a simple combination of these. Hash-based sampling has mainly two types of usage: it offers a way to emulate random sampling by using packet content to generate pseudorandom variates and a way to consistently select subsets of packets that share a common property. In the following subsections we give more details about them. However, both usages require that the Hash Functions has two statistical properties. First, the hash function h must have Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 14] Internet Draft Techniques for IP Packet Selection October 2003 good mixing properties, in the sense that small changes in the input (e.g. the flipping of a single bit) cause large changes in the output (many bits change). Then any local clump of values of c is spread widely over R by h, and so the distribution of h(c) is fairly uniform even if the distribution of c is not. Then the sampling rate is #S/#R, which can be tuned by choice of S. If S and R are sets contiguous integers, h(c), suitably shifted and normalized, can be interpreted as a pseudorandom variate. The second desirable property depends more closely on the statistics of the content c. In applications, the content c comprises a number of distinct fields, c1 ... cm, e.g. source and destination IP Address, IP identification, and TCP/UDP port numbers (if present) for a packet. With a hash function satisfying the first properties above, selection decisions will appear uncorrelated with the contents of any individual field, if the complementary fields are (i) sufficiently variable themselves, and (ii) sufficiently uncorrelated with cj. 4.2.1 Examples of Hashing Filtering applications 4.2.1.1 Random sampling emulation Although pseudorandom number generators with well understood properties have been developed, they may not be the method of choice in setting where computational resources are scarce. A convenient alternative is to use hash functions of packet content as a source of randomness. The hash (suitably renormalized) is a pseudorandom variate in the interval [0,1]. Other schemes may use packet fields in iterators for pseudorandom numbers. The point here, is that the statistical properties of the idealized packet selection law (such as independence of sampling decisions for different packets, or independence on packet content) may not be exactly shared by an implementation, but only approximately so. Although the selection decisions for non-uniform probabilistic sampling (see Section 3.1.2.2.2 above) also depend on the packet content, this form of sampling is distinguished from the use of packet content to generate variates. In the former case, the content only determines the selection probabilities: selection could then proceed e.g by use of a variates obtained by an independent pseudorandom number generator. In the latter case, the content determines the pseudorandom variates rather than the probabilities. 4.2.1.2 Consistent packet selection and its applications Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 15] Internet Draft Techniques for IP Packet Selection October 2003 In Consistent packet selection, all routers in a network hash parts of the packet content using identical hash function and selection range. The domain of the hash is restricted to those parts of the packet that are invariant from hop to hop. Fields such as Time-to-Live, which is decremented per hop, and header CRC, which is recalculated per hop, are thus excluded from the hash domain. Thus, a given packet is selected at all either all points on its path through the network, or at none. The domain of the hash function needs to be wider than just a flow key, if packets are to be selected quasi-randomly within flows (and e.g. include portions of the payload), see [DuGr00]. If a report on each selected packet is exported to a collector, the latter can reconstruct trajectories of the selected packets, provided it can match different reports on the same packet, and distinguish these from reports on different packets. For this purpose, reports may also contain a second distinct hash (identification hash) of the selected packets. Applications of consistent packet selection include (i) estimation of the network path matrix, i.e., the traffic intensities according to network path, broken down by flow; (ii) detection of routing loops, as indicated by self-intersecting trajectories; (iii) passive performance measurement: prematurely terminating trajectories indicate packet loss, and packet One Way Delay can be determined if reports include (synchronized) timestamps; (iv) network attack tracing, of the actual paths taken by attack packets with spoofed source addresses. 4.2.2 Guarding Against Pitfalls and Vulnerabilities A concern is whether some large set of related packets could be sampled at a rate that significantly differs from the expected sampling rate, either (i) through unanticipated behavior in the hash function, or (ii) because the packets had been deliberately crafted to have this property. The first point underlines the importance of using a hash function with good mixing properties. Examples of such are CRC32 and hash functions based on modular arithmetic, see 6.4 in [Knuth98]. The statistical properties of candidate hash functions need to be evaluated, preferably on packet before adoption for hash-based sampling Hash sampling could be overloaded (or evaded) by an attacker if the hash function and the selection rate are both known. A service provider could build a first defense keeping the Hash Selection Range S private. Then, an attacker could not determine whether a crafted packet is selected, but he would still know that a crafted a set of packets all with the same hash is either all selected or all not selected. Moreover, when applications Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 16] Internet Draft Techniques for IP Packet Selection October 2003 (like multi domain trajectory sampling, or One Way Delay estimation across multiple domains) may require multiple administrative entities to agree on a common hash function and selection range, mutual trust between the entities cannot be avoided and just keeping S secret may not be feasible. A stronger defense is to employ a parametrizable hash function and keep the parameter private: in this case, the set of hash values of the packets could not be determined. Examples of parameters are the initial vector in CRC32, and moduli in hashes based on modular arithmetic. 4.3 Router State filtering This class of filters select a packet on the basis of the following conditions), possibly combined with the AND, OR or NOT operators. - Ingress interface at which packet arrives equals a specified value - Egress interface to which packet is routed to equals a specified value - Packet violated Access Control List (ACL) on the router - Failed Reverse Path Forwarding (RPF) - Failed Resource Reservation (RSVP) - No route found for the packet - Origin AS equals a specified value or lies within a given range - Destination AS equals a specified value or lies within a given range Router architectural considerations may preclude some information concerning the packet treatment, e.g. routing state, being available at line rate for selection of packets. However, if selection not based on routing state has reduced down from line rate, subselection based on routing state may be feasible. 5. Input Parameters and Information Models This section gives an overview of different alternative selection schemes and their required parameters. In order to be compliant with PSAMP it is sufficient to implement one of the proposed schemes. The decision whether to select a packet or not is based on a function which is performed when the packet arrives at the selection process. Packet selection schemes differ in the input parameters for the selection process and the functions they Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 17] Internet Draft Techniques for IP Packet Selection October 2003 require to do the packet selection. The following table gives an overview. Scheme | input parameters | functions ---------------+------------------------+------------------- systematic | packet position | packet counter count-based | sampling pattern | ---------------+------------------------+------------------- systematic | arrival time | clock or timer time-based | sampling pattern | ---------------+------------------------+------------------- random | packet position | packet counter, n-out-of-N | sampling pattern | random numbers | (random number list) | ---------------+------------------------+------------------- uniform | sampling | random function probabilistic | probability | ---------------+------------------------+------------------- non-uniform |e.g. packet position | selection function, probabilistic |or packet content(parts)| probability calc. ---------------+------------------------+------------------- non-uniform |e.g. flow state | selection function, flow-state |or packet content(parts)| probability calc. ---------------+------------------------+------------------- mask/match | packet content(parts) | filter function ---------------+------------------------+------------------- hash-based | packet content(parts) | hash function ---------------+------------------------+------------------- router state | router state | router state | | discovery ---------------+------------------------+------------------- 5.1 Information Model for Sampling Techniques In this section we define the information models for most common sampling techniques. Here the selection function is pre-defined and given by the selector ID. Sampler Description: SELECTOR_ID SELECTOR_TYPE SELECTOR_PARAMETERS OPERATING_TIME ASSOCIATIONS Where: SELECTOR_ID: Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 18] Internet Draft Techniques for IP Packet Selection October 2003 Unique ID for the packet sampler. The ID can be calculated under consideration of the ASSOCIATIONS and a local ID. SELECTOR_TYPE Description: For sampling processes the SELECTOR TYPE defines what sampling algorithm is used. Values: Systematic count-based | Systematic time Based | Random n-out-of-N | uniform probabilistic | non-uniform probabilistic | non-uniform flow-state SELECTOR_PARAMETERS Description: For sampling processes the SELECTOR PARAMETERS define the input parameters for the process. Interval length in systematic sampling means, that all packets that arrive in this interval are selected. The spacing parameter defines the spacing in time or number of packets between the end of one sampling interval and the start of the next succeeding interval. Case n out of N: - Population size N, Sample size n Case Systematic Time Based: - Interval length (in usec), Spacing (in usec) Case Systematic Count Based: - Interval length(in packets), Spacing (in packets) Case uniform Probabilistic (with equal probability per packet): - Sampling probability p Case non-uniform probabilistic: - Calculation function for sampling probability p Case flow state: - information reported for flow state can be found in [MolFl03] OPERATING_TIME Description: The OPERATING_TIME parameter describes the start/stop time of sampling process. List elements must not overlap. The start time of the first element can be omitted, meaning ’’from now’’. The end time of the last element can be omitted, meaning ’’until sampler is removed’’. Values: List of (Start time, End time) ASSOCIATIONS Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 19] Internet Draft Techniques for IP Packet Selection October 2003 Description: The ASSOCIATIONS field describes the observation point and (possibly) the IPFIX processes to which the packet selector can be associated. The STREAM ID denotes the origin of the data stream that is input to the selection function. It can be the observation point directly or the ID of another selector. With this it is possible to define combined schemes. If the STREAM ID contains IDs from other selectors, one can derive the original observation point from the selector definitions of these specified selectors. Values: With STREAM ID: Observation point ID AND List of SELECTOR_IDs 5.2 Information Model for Filtering Techniques In this section we define the information models for most common filtering techniques. The information model structure closely parallels the one presented for the sampling techniques. Filter Description: SELECTOR_ID SELECTOR_TYPE SELECTOR_PARAMETERS OPERATING_TIME ASSOCIATIONS Where: SELECTOR_ID: Unique ID for the packet filter. The ID can be calculated under consideration of the ASSOCIATIONS and a local ID. SELECTOR_TYPE Description: For filtering processes the SELECTOR TYPE defines what filtering type is used. Values: Matching | Hashing | Router_state SELECTOR_PARAMETERS Description: For filtering processes the SELECTOR PARAMETERS define formally the common property of the packet being filtered. For the filters of type Matching and Hashing the definitions have a lot of points in common. Values: Case Matching -
- - -
Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 20] Internet Draft Techniques for IP Packet Selection October 2003 - - - - - Notes to Case Matching: - The filter can be defined for the header part only, for the payload part only or for both. In the latter case the matching must be an AND of the two. - The bit specification, for the header part, can be specified for ipv4 or ipv6 only, or both - In case of ipv4, the bit specification is a sequence of 20 Hexadecimal numbers [00,FF] specifying a 20 bytes bitmask to be applied to the header - In case of ipv6, it is a sequence of 40 Hexadecimal numbers [00,FF] specifying a 40 bytes bitmask to be applied to the header - The bit specification, for the payload part, is a sequence of Hexadecimal numbers [00,FF] specifying the bitmask to be applied to the first N bytes of the payload, as specified by the previous field. In case the Hexadecimal number sequence is longer then N, only the first N numbers are considered. - In case the payload is shorter than N, the packet will not match the filter Other options, like padding with zeros, may be considered in the future. - The selection interval specification is a list of non overlapping intervals [intv_begin, intv_end] where intv_begin, intv_end are bit strings of length 20*8 (ipv4 case), 40*8 (ipv6 case), N*8 (payload case). - A filter cannot be defined on the options field of the ipv4 header, neither on stacked headers of ipv6. - This specification doesn’t preclude the future definition of a high level syntax for defining in a concise way bit selection and matching rules in a more human readable form (e.g. ‘‘TCP port in [2000,3000]’’). The requirement is that such a syntax can be univoquely compiled into the one defined above Case Hashing: -
- -
- - - - Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 21] Internet Draft Techniques for IP Packet Selection October 2003 - Hashing function specification - Hash function name - Length of input key (eliminate 0x bytes) - Output value (length M and bitmask) - Selection interval specification, as a list of non overlapping intervals [start value, end value] where value is in [0,2^M-1] - Additional parameters dependent on specific hash function Notes to Case Hashing: - On Input bit specifications fields, the same notes on bit specifications of the Matching case reported above apply - Some hash functions, their detailed functioning and their specific parameter list are described in [NicMoDu]. Case Router State: - Ingress interface at which packet arrives equals a specified value - Egress interface to which packet is routed to equals a specified value - Packet violated Access Control List (ACL) on the router - Failed Reverse Path Forwarding (RPF) - Failed Resource Reservation (RSVP) - No route found for the packet - Origin AS equals a specified value or lies within a given range - Destination AS equals a specified value or lies within a given range Note to Case Router State: - All Router state entries can be linked by AND, OR, NOT operators OPERATING_TIME Description: The OPERATING_TIME parameter describes the start/stop time of filtering process. List elements must not overlap. The start time of the first element can be omitted, meaning ‘‘from now’’. The end time of the last element can be omitted, meaning ‘‘until sampler is removed’’. Values: List of (Start time, End time) ASSOCIATIONS Description: The ASSOCIATIONS field describes the observation point and (possibly) the IPFIX processes to which the packet selector can be associated. The STREAM ID denotes the origin of the data stream that is input to the selection function. It can Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 22] Internet Draft Techniques for IP Packet Selection October 2003 be the observation point directly or the ID of another selector. With this it is possible to define combined schemes. If the STREAM ID contains IDs from other selectors, one can derive the original observation point from the selector definitions of these specified selectors. Values: STREAM ID, Metering process ID, Exporting process ID> With STREAM ID: Observation point ID AND List of SELECTOR_IDs 6. Composite Techniques Composite schemes are realized by using the STREAM ID in the information models. The STREAM ID denotes from which selectors the input stream originates. If multiple stream IDs are given, this means that the selector operates on the packet stream simply resulting from the time superposition of the output of all the listed filters and samplers. Note that a sampler/filter could be intermittently active, as defined in the OPERATING TIME field. Some examples of composite schemes are reported below. 6.1 Cascaded filtering->sampling or sampling->filtering If a filter precedes a sampling process the role of filtering is to create a set of ‘‘parent populations’’ from a single stream that can then be fed independently to different sampling functions, with different parameters tuned for the population itself (e.g. if streams of different intensity result from filtering, it may be good to have different sampling rates). If filtering follows a sampling process, the same sampling rate and type is applied to the whole stream, independently of the relative size of the streams resulting from the filtering function. Moreover, also packets not destined to be selected will ‘‘load’’ the sampling function. So, in principle, filtering before sampling allows a more accurate tuning of the sampling procedure, but if filters are too complex to work at full line rate (e.g. because they have to access router state information), sampling before filtering may be a need. 6.2 Stratified Sampling Stratified sampling is one example for using a composite technique. The basic idea behind stratified sampling is to increase the estimation accuracy by using a-priori information. The a-priori information is used to perform an intelligent grouping of the elements of the parent population. With this a higher estimation accuracy can be achieved with the same sample size. Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 23] Internet Draft Techniques for IP Packet Selection October 2003 Stratified sampling divides the sampling process into multiple steps. First, the elements of the parent population are grouped into subsets in accordance to a given characteristic. This grouping can be done in multiple steps. Then samples are taken from each subset. The stronger the correlation between the characteristic used to divide the parent population and the characteristic of interest (for which an estimate is sought after), the easier is the consecutive sampling process and the higher is the stratification gain. For instance if the dividing characteristic were equal to the investigated characteristic, each element of the sub-group would be a perfect representative of that characteristic. In this case it would be sufficient to take one arbitrary element out of each subgroup to get the actual distribution of the characteristic in the parent population. Therefore stratified sampling can reduce the costs for the sampling process (i.e. the number of samples needed to achieve a given level of confidence). For stratified sampling one has to specify classification rules for grouping the elements into subgroups and the sampling scheme that is used within the subgroups. The classification rules can be expressed by multiple filters. For the sampling scheme within the subgroups the parameters have to be specified as described above. 7. Security Considerations Security threats can occur if the configuration of sampling parameters or the communication of sampling parameters to the application is corrupted. This document only describes sampling schemes that can be used for packet selection. It neither describes a mechanism how those parameters are configured nor how these parameters are communicated to the application. Therefore the security threats that originate from this kind of communication cannot be assessed with the information given in this document. In some cases malicious users or attackers may be interested to hide packets from the service provider. For instance if packet selectors are used for accounting or intrusion detection applications, users may want to prevent that packets are selected. If a deterministic sampling scheme is used or a selection scheme that takes packet content into account, the user can shape or send packets in a way that they are less likely to be selected. This has to be taken into account when choosing an appropriate packet selection technique. Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 24] Internet Draft Techniques for IP Packet Selection October 2003 8. References [AmCa89] Paul D. Amer, Lillian N. Cassel: Management of Sampled Real-Time Network Measurements, 14th Conference on Local Computer Networks, October 1989, Minneapolis, pages 62-68, IEEE, 1989 [ClPB93] K.C. Claffy, George C Polyzos, Hans-Werner Braun: Application of Sampling Methodologies to Network Traffic Characterization, Proceedings of ACM SIGCOMM'93, San Francisco, CA, USA, September 13 - 17, 1993 [CoGi98] I. Cozzani, S. Giordano: Traffic Sampling Methods for end-to-end QoS Evaluation in Large Heterogeneous Networks. Computer Networks and ISDN Systems, 30 (16-18), September 1998. [DuGG03] Nick Duffield, Albert Greenberg, Matthias Grossglauser, Jennifer Rexford: A Framework for Passive Packet Measurement, Internet Draft draft- ietf-psamp-framework-01, work in progress, June 2003 [DuGr00] Nick Duffield, Matthias Grossglauser: Trajectory Sampling for Direct Traffic Observation, Proceedings of ACM SIGCOMM 2000, Stockholm, Sweden, August 28 - September 1, 2000. [EsVa01] C. Estan and G. Varghese, ‘‘New Directions in Traffic Measurement and Accounting’’, ACM SIGCOMM Internet Measurement Workshop 2001, San Francisco (CA) Nov. 2001 [HT52] D.G. Horvitz and D.J. Thompson, A Generalization of Sampling without replacement from a Finite Universe. J. Amer. Statist. Assoc. Vol. 47, pp. 663-685, 1952. [JePP92] Jonathan Jedwab, Peter Phaal, Bob Pinna: Traffic Estimation for the Largest Sources on a Network, Using Packet Sampling with Limited Storage, HP technical report, Managemenr, Mathematics and Security Department, HP Laboratories, Bristol, March 1992, http://www.hpl.hp.com/techreports/92/HPL-92-35.html Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 25] Internet Draft Techniques for IP Packet Selection October 2003 [Knuth98] Donald E. Knuth: The Art of Computer Programming, Volume 3: Searching and Sorting, Addison Wesley, 1998 [MolFl03] M.Molina: Flow selection support in IPFIX, Internet Draft , work in progress, October 2003. [Molina03] M.Molina: A scalable and efficient methodology for flow monitoring in the internet, International Teletraffic Congress (ITC-18), Berlin, Sep. 2003 [NicMoDu] S. Niccolini, M.Molina, N.Duffield: Hash functions description for packet selection, Internet Draft , work in progress, October 2003. [QuZC03] J. Quittek, T. Zseby, B. Claise, S. Zander: Requirements for IP Flow Information Export, Internet Draft , work in progress, June 2002 [Zseb02] Tanja Zseby: Deployment of Sampling Methods for SLA Validation with Non-Intrusive Measurements, Proceedings of Passive and Active Measurement Workshop (PAM 2002), Fort Collins, CO, USA, March 25-26, 2002 9. Author's Addresses Tanja Zseby Fraunhofer Institute for Open Communication Systems Kaiserin-Augusta-Allee 31 10589 Berlin Germany Phone: +49-30-34 63 7153 Fax: +49-30-34 53 8153 Email: zseby@fokus.fhg.de Maurizio Molina NEC Europe Ltd., Network Laboratories Adenauerplatz 6 69115 Heidelberg Germany Phone: +49 6221 90511-18 Email: molina@ccrle.nec.de Fredric Raspall NEC Europe Ltd., Network Laboratories Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 26] Internet Draft Techniques for IP Packet Selection October 2003 Adenauerplatz 6 69115 Heidelberg Germany Phone: +49 6221 90511-31 EMail: raspall@ccrle.nec.de Nick Duffield AT&T Labs - Research Room B-139 180 Park Ave Florham Park NJ 07932, USA Phone: +1 973-360-8726 Email: duffield@research.att.com 10. Intellectual Property Statement AT&T Corporation may own intellectual property applicable to this contribution. The IETF has been notified of AT&T's licensing intent for the specification contained in this document. See http://www.ietf.org/ietf/IPR/ATT-GENERAL.txt for AT&T's IPR statement. 11. Full Copyright Statement Copyright (C) The Internet Society (2002). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 27] Internet Draft Techniques for IP Packet Selection October 2003 OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Zseby, Molina, Raspall, Duffield Expires April 2004 [Page 28]