IPFIX Working Group B. Trammell Internet-Draft CERT/NetSA Expires: December 25, 2006 E. Boschi Hitachi Europe L. Mark T. Zseby Fraunhofer FOKUS June 23, 2006 An IPFIX-Based File Format draft-trammell-ipfix-file-01.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on December 25, 2006. Copyright Notice Copyright (C) The Internet Society (2006). Abstract This document describes a file format for the storage of flow data based upon the IPFIX message format. It proposes a set of requirements for flat-file, binary flow data file formats, evaluates flow storage systems presently in use for their conformance to these Trammell, et al. Expires December 25, 2006 [Page 1] Internet-Draft IPFIX Files June 2006 requirements, then applies the IPFIX message format to these requirements to build a new file format. This IPFIX file format is designed especially to be useful to the implementors of IPFIX Collecting Processes. 1. Introduction The IPFIX message format makes an ideal basis for a standard flow file format for archival storage purposes and document-based workflow support. As it was designed for the efficient and flexible representation of a variety flow and flow-like data, it is more extensible than ad-hoc file formats derived from simple data model serialization, and more efficient than record-structured textual formats such as XML. This document explores the motivation for building a flow file format atop the IPFIX message format. It then proposes a set of requirements for this file format, and describes either how the IPFIX message format meets each requirement, how a file format based upon it could meet the requirement, or how the message format must be extended to meet the requirement. The document also examines existing flow storage file formats for their conformance to these requirements. The purpose of this revision of the document is to foster discussion on the motivation and requirements sections in advance of proposing the design of a file format; consequently, the sections on the file format itself and examples of IPFIX files are currently placeholders without content. It is our aim to use this document and discussions concerning it in the IPFIX working group as a basis for future work on this effort. 2. Terminology Terms used in this document that are defined in the Terminology section of the IPFIX Protocol [1] document are to be interpreted as defined there. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [3]. 3. Motivation We have identified two major use cases for file-based storage of IP Trammell, et al. Expires December 25, 2006 [Page 2] Internet-Draft IPFIX Files June 2006 flow data. The first is long-term, persistent storage of flow data for archival purposes. Filesystems often make sense as a persistent storage backend due to their ubiquity, simplicity, and flexibility. There are a wide variety of operations available on files (e.g., external compression and encryption, atomic backup) that are made more difficult with a more integrated persistent storage system such as a relational database management system (RDBMS). As flow data is often not very semantically complicated, and is managed in very high volume, the simplicity of a file-based persistent storage backend can outweigh the advantages of these other storage systems. The second use case is in document-based workflows. Users of many information processing systems are accustomed to dealing with documents which encapsulate all the information about a work item or collection of work items; even in situations in which document-based workflows may have significant disadvantages (e.g., revision control in a multi-editor environment), many user communities still prefer documents as the "atom" of work due to their simplicity. As an example relevant to flow data, the most common unit of work in the network forensics and research communities is the packet trace file, and utilities such as Ethereal explicitly treat these packet traces as documents. It seems likely that as flow data analysis tools are developed, many will choose to support a document-based workflow; a standard format for this document would be of great use to the analysis community. Document-based workflows are especially well supported by file-based formats. The simplest way to create a new file format is simply to serialize some internal data model to disk, with either textual or binary representation of data elements, and some framing strategy for delimiting fields and records. "Ad-hoc" file formats such as this have several important disadvantages. One, they impose the semantics of the data model from which they are derived on the file format; as such, they are difficult to extend, describe, and standardize. The emergence over the past decade of XML as a new "universal" framing format for flat as well as heirarchical data addresses these concerns; however, XML is not necessarily ideal for a storage format for flow data. First, flow data, being inherently simple and record- oriented, does not benefit from the more advanced semantics available with XML. There is not much to be gained by describing each record individually when the records all have the same format, or one of a small set of formats. Second, XML processing introduces potentially significant overhead. While an XML stream should in theory be approximately as compressible as any other stream representation, the additional compression/decompression and generation/parsing of XML data is not worth the benefit in this case. Trammell, et al. Expires December 25, 2006 [Page 3] Internet-Draft IPFIX Files June 2006 This leads us to propose the IPFIX message format as the basis for a new flow data file format. The IPFIX working group, in defining the IPFIX protocol, has already defined an information model and data formatting rules for representation of flow data. Especially in the document-based workflow use case, a file may be viewed as simply another IPFIX message transport between processes. This format is especially well suited to representing flow data, as it was designed specifically for that use case; it is easily extensible unlike ad-hoc serialization, and compact unlike XML. In addition, IPFIX is an emerging standard for the export and collection of flow data; using a common format for storage and analysis at the collection side allows implementors to use substantially the same information model and data formatting implementation for transport as well as storage. 4. Requirements In this section, we outline a proposed set of requirements for any persistent storage format for flow data. First and foremost, a flow data file format should support both of the broad use cases addressed in the Motivation. In addition, the requirements enumerated in the sections below apply to both use cases. For each, we first identify the requirement, then explain how the IPFIX message format addresses it, or briefly outline the changes that must be made in order for an IPFIX-based file format to meet the requirement. 4.1. Extensibility Due to the wide variety of flow attributes collected by different network flow attribute measurement systems, the ideal flow storage format will not impose a single data model or a specific record type on the flows it stores. The file format must be extensible; that is, it must be flexible enough to support multiple record types, and must be able to support new field types for data within the records in a graceful way. IPFIX provides extensibility through the use of Templates to describe each Data Record, through the use of an IANA Registry to define its Information Elements, and through the use of enterprise-specific Information Elements. 4.2. Self Description Archived data may be read at a time in the future where any external reference to the meaning of the data may be lost. The ideal flow storage format should be self-describing; that is, a process reading flow data from storage should be able to properly interpret the stored flows without reference to anything other than standard Trammell, et al. Expires December 25, 2006 [Page 4] Internet-Draft IPFIX Files June 2006 sources (e.g., the standards document describing the file format) and the stored flow data itself The IPFIX message format is partially self-describing; that is, IPFIX Templates containing only IANA-assigned Information Elements can be completely interpreted according to the IPFIX Information Model without additional external data. However, to be fully self- describing, the IPFIX message format would require extension to add type and semantic information to the definitions of enterprise- specific Information Elements. 4.3. Data Compression Regardless of the representation format, flow data describing traffic on real networks tends to be highly compressible. Compression tends to improve the scalability of flow collection systems, by reducing the disk storage and I/O bandwidth requirement for a given workload. The ideal flow storage format should support applications which wish to leverage this fact by supporting compression of stored data. The IPFIX message format has no support for data compression. However, any flat file is readily compressible using a wide variety of external data compression tools, formats, and algorithms. If finer granularity than file-level compression is required, the IPFIX message format would require an extension to add some notation that a record set or message is compressed. 4.4. Indexing and Searching Binary, record stream oriented file formats natively support only one form of searching, sequential scan in file order. By choosing the order of records in a file carefully (e.g., by time), a file can be "indexed" by a single key. Adding additional indexes to the file can speed searches considerably. The ideal flow storage format will support a method for noting that the records in a file are sorted by a certain key or set of keys, and for providing index information for keys on which the file is not sorted. There is presently no support for indexing or sort order notation in the IPFIX message format. If internal indexing is required, it would need to be added to an IPFIX-based file format by extension. 4.5. Data Integrity and Error Correction When storing flow data for archival purposes, it is important to ensure that hardware or software faults do not introduce errors into the data over time. The ideal flow storage format will support the detection and correction of encoding-level errors in the data. Trammell, et al. Expires December 25, 2006 [Page 5] Internet-Draft IPFIX Files June 2006 Note that this requirement is almost certainly best handled at a layer below that addressed by this document. Error correction is a topic well addressed by filesystem developers and the storage industry in general, and by specifying a flow storage format based upon files, we can leverage these features to meet this requirement. The IPFIX message format does not support data integrity assurance or error correction; it is assumed that this requirement will be met externally. 4.6. Creator Authentication and Confidentiality Archival storage of flow data also requires assurance that no unauthorized entity can read or modify the stored data. Asymmetric- key cryptography can be applied to this problem, by signing flow data with the private key of the creator, and encrypting it with the public keys of those authorized to read it. The ideal flow storage format will support the encryption and signing of flow data. As with error correction, this problem has been addressed well at a layer below that addressed by this document. Instead of specifying a particular choice of encryption technology, we can leverage the fact that existing cryptographic technologies work quite well on data stored in files to meet this requirement. Beyond support for the use of TLS for transport over TCP or SCTP, both of which provide transient authentication and confidentiality, the IPFIX message format does not support this requirement directly. It is assumed that this requirement will be met externally. 4.7. Anonymization and Obfuscation To ensure the privacy of individuals and organizations at the endpoints of communications represented by flow records, it is often necessary to obfuscate or anonymize stored and exported flow data. The ideal flow storage format will provide for a notation that a given information element on a given record type represents anonymized, rather than real, data. The IPFIX message format has presently has no support for anonymization notation. It should be noted that anonymization is one of the requirements given for IPFIX in RFC 3917 [2]. The decision to qualify this requirement with 'MAY' and not 'MUST' in the requirements document, and its subsequent lack of specification in the current version of the IPFIX protocol, is due to the fact that anonymization algorithms are still a research issue, and that there currently exist no standardized methods for anonymization. Trammell, et al. Expires December 25, 2006 [Page 6] Internet-Draft IPFIX Files June 2006 It is reasonable to assume, given the stated requirements for the IPFIX protocol itself, that future extensions to the protocol will provide for the anonymization of flow records. 5. Survey of Existing Flow and Trace File Formats 5.1. Argus 2 QoSient's Argus (as of version 2.0.6) uses a file format based upon a stream of type-and-length prefixed records. There are two general types of records in this stream, management records and flow records. Management records export flow collection statistics, much like the recommended scoped data records in the IPFIX protocol. Flow records contain information about a single flow each, and are further typed based upon the protocol of the flow (e.g., IP, ICMP, ARP). The Argus file format natively spports bidirectional flow export, as each flow record contains both forward and reverse counters. The Argus tools support a transport protocol that simply encapsulates a record stream over a TCP connection. Transport is collector- initiated; that is, a collector establishes a connection to an exporter in order to read a record stream. Argus files are not self-describing; that is, only the Argus tools themselves encapsulate the definition of each of the record types. The Argus file format is not extensible without changing the Argus implementation. Argus provides no indexing facility for its file format, though records are roughly sorted by record generation time. Compression, error correction, authentication, and confidentiality are handled externally to the format, and are available as with all files. There is no special support for data obfuscation in the format. 5.2. SiLK The CERT/NetSA SiLK tools (http://silktools.sourceforge.net) use a set of fixed-length binary record formats. Each file is prefixed with a header which denotes which record format the file is stored in. These record formats are differentiated by the presence or absence of certain fields; in this way, each format identifier is essentially a short-hand identifier for a template describing the record. This also implies that only one type of record may be stored in any given file. As with Argus, SiLK files are not self-describing and are not extensible. SiLK provides no indexing facility, though files are generally stored in flow end time order; and when used for archival Trammell, et al. Expires December 25, 2006 [Page 7] Internet-Draft IPFIX Files June 2006 storage, information about sensors and flow times appearing in each file is stored in the file path name. Compression is handled internally to the file format, and allows the storage of compressed data in a file with uncompressed headers, and a guarantee of compression block boundary alignment with record boundaries. Error correction, authentication, and confidentiality can be handled externally. There is no special support for data obfuscation in the SiLK file format. 6. IPFIX File Format Description The IPFIX file format description is not yet available, as the purpose of this document is to elicit feedback and foster discussion on the motivation and requirements for an IPFIX file format. A future revision of this document will include a complete description of the IPFIX file format in terms of the IPFIX message format. 7. Examples Examples are not yet available as the file format has not yet been fully described. A future revision of this document will contain examples. 8. Security Considerations The IPFIX-based file format itself does not directly introduce security issues. Rather it is used to store information which may for privacy or business issues be considered sensitive. The file format must therefore provide appropriate procedures to guarantee the integrity and confidentiality of the stored information. The underlying protocol used to exchange the information that will be stored using the format proposed in this document must as well apply appropriate procedures to guarantee the integrity and confidentiality of the exported information. Such issues are addressed in separate documents, specifically in the IPFIX Protocol [1]. 9. IANA Considerations This document has no actions for IANA. 10. Open Issues and Notes Trammell, et al. Expires December 25, 2006 [Page 8] Internet-Draft IPFIX Files June 2006 This draft is presently incomplete. The intent of this revision is to provide a starting point to discuss requirements for an IPFIX- based file format, and the applicability of the work proposed herein to the mission of the IPFIX Working Group. [bht] The survey of existing file formats is incomplete, and includes only file formats with which one of the authors has personal experience. [bht] There should be a mention of the zero-length message and set hacks (from -00) somewhere in this draft, to support the IPFIX header on legacy fixed-length binary file use case. [bht] 11. References 11.1. Normative References [1] Claise, B., "IPFIX Protocol Specification", draft-ietf-ipfix-protocol-22 (work in progress), June 2006. 11.2. Informative References [2] Quittek, J., Zseby, T., Claise, B., and S. Zander, "Requirements for IP Flow Information Export (IPFIX)", RFC 3917, October 2004. [3] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. Trammell, et al. Expires December 25, 2006 [Page 9] Internet-Draft IPFIX Files June 2006 Authors' Addresses Brian H. Trammell CERT Network Situational Awareness Software Engineering Institute 4500 Fifth Avenue Pittsburgh, PA 15213 United States Phone: +1 412 268 9748 Email: bht@cert.org Elisa Boschi Hitachi Europe SAS Immueble Le Theleme 1503 Route les Dolines Valbonne 06560 France Phone: +33 4 89874180 Email: elisa.boschi@hitachi-eu.com Lutz Mark Fraunhofer Institute for Open Communication Systems Kaiserin-Augusta-Allee 31 Berlin 10589 Germany Phone: +49 30 3463 7306 Email: mark@fokus.fraunhofer.de Tanja Zseby Fraunhofer Institute for Open Communication Systems Kaiserin-Augusta-Allee 31 Berlin 10589 Germany Phone: +49 30 3463 7153 Email: zseby@fokus.fraunhofer.de Trammell, et al. Expires December 25, 2006 [Page 10] Internet-Draft IPFIX Files June 2006 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. Acknowledgment Funding for the RFC Editor function is currently provided by the Internet Society. Trammell, et al. Expires December 25, 2006 [Page 11]