IPFIX Working Group                                          B. Trammell
Internet-Draft                                                CERT/NetSA
Expires: December 25, 2006                                     E. Boschi
                                                          Hitachi Europe
                                                                 L. Mark
                                                                T. Zseby
                                                        Fraunhofer FOKUS
                                                           June 23, 2006


                       An IPFIX-Based File Format
                    draft-trammell-ipfix-file-01.txt

Status of this Memo

   By submitting this Internet-Draft, each author represents that any
   applicable patent or other IPR claims of which he or she is aware
   have been or will be disclosed, and any of which he or she becomes
   aware will be disclosed, in accordance with Section 6 of BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt.

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.

   This Internet-Draft will expire on December 25, 2006.

Copyright Notice

   Copyright (C) The Internet Society (2006).

Abstract

   This document describes a file format for the storage of flow data
   based upon the IPFIX message format.  It proposes a set of
   requirements for flat-file, binary flow data file formats, evaluates
   flow storage systems presently in use for their conformance to these


Trammell, et al.        Expires December 25, 2006               [Page 1]

Internet-Draft                 IPFIX Files                     June 2006


   requirements, then applies the IPFIX message format to these
   requirements to build a new file format.  This IPFIX file format is
   designed especially to be useful to the implementors of IPFIX
   Collecting Processes.


1.  Introduction

   The IPFIX message format makes an ideal basis for a standard flow
   file format for archival storage purposes and document-based workflow
   support.  As it was designed for the efficient and flexible
   representation of a variety flow and flow-like data, it is more
   extensible than ad-hoc file formats derived from simple data model
   serialization, and more efficient than record-structured textual
   formats such as XML.

   This document explores the motivation for building a flow file format
   atop the IPFIX message format.  It then proposes a set of
   requirements for this file format, and describes either how the IPFIX
   message format meets each requirement, how a file format based upon
   it could meet the requirement, or how the message format must be
   extended to meet the requirement.  The document also examines
   existing flow storage file formats for their conformance to these
   requirements.

   The purpose of this revision of the document is to foster discussion
   on the motivation and requirements sections in advance of proposing
   the design of a file format; consequently, the sections on the file
   format itself and examples of IPFIX files are currently placeholders
   without content.  It is our aim to use this document and discussions
   concerning it in the IPFIX working group as a basis for future work
   on this effort.


2.  Terminology

   Terms used in this document that are defined in the Terminology
   section of the IPFIX Protocol [1] document are to be interpreted as
   defined there.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [3].


3.  Motivation

   We have identified two major use cases for file-based storage of IP


Trammell, et al.        Expires December 25, 2006               [Page 2]

Internet-Draft                 IPFIX Files                     June 2006


   flow data.  The first is long-term, persistent storage of flow data
   for archival purposes.  Filesystems often make sense as a persistent
   storage backend due to their ubiquity, simplicity, and flexibility.
   There are a wide variety of operations available on files (e.g.,
   external compression and encryption, atomic backup) that are made
   more difficult with a more integrated persistent storage system such
   as a relational database management system (RDBMS).  As flow data is
   often not very semantically complicated, and is managed in very high
   volume, the simplicity of a file-based persistent storage backend can
   outweigh the advantages of these other storage systems.

   The second use case is in document-based workflows.  Users of many
   information processing systems are accustomed to dealing with
   documents which encapsulate all the information about a work item or
   collection of work items; even in situations in which document-based
   workflows may have significant disadvantages (e.g., revision control
   in a multi-editor environment), many user communities still prefer
   documents as the "atom" of work due to their simplicity.  As an
   example relevant to flow data, the most common unit of work in the
   network forensics and research communities is the packet trace file,
   and utilities such as Ethereal explicitly treat these packet traces
   as documents.  It seems likely that as flow data analysis tools are
   developed, many will choose to support a document-based workflow; a
   standard format for this document would be of great use to the
   analysis community.  Document-based workflows are especially well
   supported by file-based formats.

   The simplest way to create a new file format is simply to serialize
   some internal data model to disk, with either textual or binary
   representation of data elements, and some framing strategy for
   delimiting fields and records.  "Ad-hoc" file formats such as this
   have several important disadvantages.  One, they impose the semantics
   of the data model from which they are derived on the file format; as
   such, they are difficult to extend, describe, and standardize.

   The emergence over the past decade of XML as a new "universal"
   framing format for flat as well as heirarchical data addresses these
   concerns; however, XML is not necessarily ideal for a storage format
   for flow data.  First, flow data, being inherently simple and record-
   oriented, does not benefit from the more advanced semantics available
   with XML.  There is not much to be gained by describing each record
   individually when the records all have the same format, or one of a
   small set of formats.  Second, XML processing introduces potentially
   significant overhead.  While an XML stream should in theory be
   approximately as compressible as any other stream representation, the
   additional compression/decompression and generation/parsing of XML
   data is not worth the benefit in this case.


Trammell, et al.        Expires December 25, 2006               [Page 3]

Internet-Draft                 IPFIX Files                     June 2006


   This leads us to propose the IPFIX message format as the basis for a
   new flow data file format.  The IPFIX working group, in defining the
   IPFIX protocol, has already defined an information model and data
   formatting rules for representation of flow data.  Especially in the
   document-based workflow use case, a file may be viewed as simply
   another IPFIX message transport between processes.  This format is
   especially well suited to representing flow data, as it was designed
   specifically for that use case; it is easily extensible unlike ad-hoc
   serialization, and compact unlike XML.  In addition, IPFIX is an
   emerging standard for the export and collection of flow data; using a
   common format for storage and analysis at the collection side allows
   implementors to use substantially the same information model and data
   formatting implementation for transport as well as storage.


4.  Requirements

   In this section, we outline a proposed set of requirements for any
   persistent storage format for flow data.  First and foremost, a flow
   data file format should support both of the broad use cases addressed
   in the Motivation.  In addition, the requirements enumerated in the
   sections below apply to both use cases.  For each, we first identify
   the requirement, then explain how the IPFIX message format addresses
   it, or briefly outline the changes that must be made in order for an
   IPFIX-based file format to meet the requirement.

4.1.  Extensibility

   Due to the wide variety of flow attributes collected by different
   network flow attribute measurement systems, the ideal flow storage
   format will not impose a single data model or a specific record type
   on the flows it stores.  The file format must be extensible; that is,
   it must be flexible enough to support multiple record types, and must
   be able to support new field types for data within the records in a
   graceful way.

   IPFIX provides extensibility through the use of Templates to describe
   each Data Record, through the use of an IANA Registry to define its
   Information Elements, and through the use of enterprise-specific
   Information Elements.

4.2.  Self Description

   Archived data may be read at a time in the future where any external
   reference to the meaning of the data may be lost.  The ideal flow
   storage format should be self-describing; that is, a process reading
   flow data from storage should be able to properly interpret the
   stored flows without reference to anything other than standard


Trammell, et al.        Expires December 25, 2006               [Page 4]

Internet-Draft                 IPFIX Files                     June 2006


   sources (e.g., the standards document describing the file format) and
   the stored flow data itself

   The IPFIX message format is partially self-describing; that is, IPFIX
   Templates containing only IANA-assigned Information Elements can be
   completely interpreted according to the IPFIX Information Model
   without additional external data.  However, to be fully self-
   describing, the IPFIX message format would require extension to add
   type and semantic information to the definitions of enterprise-
   specific Information Elements.

4.3.  Data Compression

   Regardless of the representation format, flow data describing traffic
   on real networks tends to be highly compressible.  Compression tends
   to improve the scalability of flow collection systems, by reducing
   the disk storage and I/O bandwidth requirement for a given workload.
   The ideal flow storage format should support applications which wish
   to leverage this fact by supporting compression of stored data.

   The IPFIX message format has no support for data compression.
   However, any flat file is readily compressible using a wide variety
   of external data compression tools, formats, and algorithms.  If
   finer granularity than file-level compression is required, the IPFIX
   message format would require an extension to add some notation that a
   record set or message is compressed.

4.4.  Indexing and Searching

   Binary, record stream oriented file formats natively support only one
   form of searching, sequential scan in file order.  By choosing the
   order of records in a file carefully (e.g., by time), a file can be
   "indexed" by a single key.  Adding additional indexes to the file can
   speed searches considerably.  The ideal flow storage format will
   support a method for noting that the records in a file are sorted by
   a certain key or set of keys, and for providing index information for
   keys on which the file is not sorted.

   There is presently no support for indexing or sort order notation in
   the IPFIX message format.  If internal indexing is required, it would
   need to be added to an IPFIX-based file format by extension.

4.5.  Data Integrity and Error Correction

   When storing flow data for archival purposes, it is important to
   ensure that hardware or software faults do not introduce errors into
   the data over time.  The ideal flow storage format will support the
   detection and correction of encoding-level errors in the data.


Trammell, et al.        Expires December 25, 2006               [Page 5]

Internet-Draft                 IPFIX Files                     June 2006


   Note that this requirement is almost certainly best handled at a
   layer below that addressed by this document.  Error correction is a
   topic well addressed by filesystem developers and the storage
   industry in general, and by specifying a flow storage format based
   upon files, we can leverage these features to meet this requirement.

   The IPFIX message format does not support data integrity assurance or
   error correction; it is assumed that this requirement will be met
   externally.

4.6.  Creator Authentication and Confidentiality

   Archival storage of flow data also requires assurance that no
   unauthorized entity can read or modify the stored data.  Asymmetric-
   key cryptography can be applied to this problem, by signing flow data
   with the private key of the creator, and encrypting it with the
   public keys of those authorized to read it.  The ideal flow storage
   format will support the encryption and signing of flow data.

   As with error correction, this problem has been addressed well at a
   layer below that addressed by this document.  Instead of specifying a
   particular choice of encryption technology, we can leverage the fact
   that existing cryptographic technologies work quite well on data
   stored in files to meet this requirement.

   Beyond support for the use of TLS for transport over TCP or SCTP,
   both of which provide transient authentication and confidentiality,
   the IPFIX message format does not support this requirement directly.
   It is assumed that this requirement will be met externally.

4.7.  Anonymization and Obfuscation

   To ensure the privacy of individuals and organizations at the
   endpoints of communications represented by flow records, it is often
   necessary to obfuscate or anonymize stored and exported flow data.
   The ideal flow storage format will provide for a notation that a
   given information element on a given record type represents
   anonymized, rather than real, data.

   The IPFIX message format has presently has no support for
   anonymization notation.  It should be noted that anonymization is one
   of the requirements given for IPFIX in RFC 3917 [2].  The decision to
   qualify this requirement with 'MAY' and not 'MUST' in the
   requirements document, and its subsequent lack of specification in
   the current version of the IPFIX protocol, is due to the fact that
   anonymization algorithms are still a research issue, and that there
   currently exist no standardized methods for anonymization.


Trammell, et al.        Expires December 25, 2006               [Page 6]

Internet-Draft                 IPFIX Files                     June 2006


   It is reasonable to assume, given the stated requirements for the
   IPFIX protocol itself, that future extensions to the protocol will
   provide for the anonymization of flow records.


5.  Survey of Existing Flow and Trace File Formats

5.1.  Argus 2

   QoSient's Argus (as of version 2.0.6) uses a file format based upon a
   stream of type-and-length prefixed records.  There are two general
   types of records in this stream, management records and flow records.
   Management records export flow collection statistics, much like the
   recommended scoped data records in the IPFIX protocol.  Flow records
   contain information about a single flow each, and are further typed
   based upon the protocol of the flow (e.g., IP, ICMP, ARP).  The Argus
   file format natively spports bidirectional flow export, as each flow
   record contains both forward and reverse counters.

   The Argus tools support a transport protocol that simply encapsulates
   a record stream over a TCP connection.  Transport is collector-
   initiated; that is, a collector establishes a connection to an
   exporter in order to read a record stream.

   Argus files are not self-describing; that is, only the Argus tools
   themselves encapsulate the definition of each of the record types.
   The Argus file format is not extensible without changing the Argus
   implementation.  Argus provides no indexing facility for its file
   format, though records are roughly sorted by record generation time.
   Compression, error correction, authentication, and confidentiality
   are handled externally to the format, and are available as with all
   files.  There is no special support for data obfuscation in the
   format.

5.2.  SiLK

   The CERT/NetSA SiLK tools (http://silktools.sourceforge.net) use a
   set of fixed-length binary record formats.  Each file is prefixed
   with a header which denotes which record format the file is stored
   in.  These record formats are differentiated by the presence or
   absence of certain fields; in this way, each format identifier is
   essentially a short-hand identifier for a template describing the
   record.  This also implies that only one type of record may be stored
   in any given file.

   As with Argus, SiLK files are not self-describing and are not
   extensible.  SiLK provides no indexing facility, though files are
   generally stored in flow end time order; and when used for archival


Trammell, et al.        Expires December 25, 2006               [Page 7]

Internet-Draft                 IPFIX Files                     June 2006


   storage, information about sensors and flow times appearing in each
   file is stored in the file path name.  Compression is handled
   internally to the file format, and allows the storage of compressed
   data in a file with uncompressed headers, and a guarantee of
   compression block boundary alignment with record boundaries.  Error
   correction, authentication, and confidentiality can be handled
   externally.  There is no special support for data obfuscation in the
   SiLK file format.


6.  IPFIX File Format Description

   The IPFIX file format description is not yet available, as the
   purpose of this document is to elicit feedback and foster discussion
   on the motivation and requirements for an IPFIX file format.  A
   future revision of this document will include a complete description
   of the IPFIX file format in terms of the IPFIX message format.


7.  Examples

   Examples are not yet available as the file format has not yet been
   fully described.  A future revision of this document will contain
   examples.


8.  Security Considerations

   The IPFIX-based file format itself does not directly introduce
   security issues.  Rather it is used to store information which may
   for privacy or business issues be considered sensitive.  The file
   format must therefore provide appropriate procedures to guarantee the
   integrity and confidentiality of the stored information.

   The underlying protocol used to exchange the information that will be
   stored using the format proposed in this document must as well apply
   appropriate procedures to guarantee the integrity and confidentiality
   of the exported information.  Such issues are addressed in separate
   documents, specifically in the IPFIX Protocol [1].


9.  IANA Considerations

   This document has no actions for IANA.


10.  Open Issues and Notes


Trammell, et al.        Expires December 25, 2006               [Page 8]

Internet-Draft                 IPFIX Files                     June 2006


   This draft is presently incomplete.  The intent of this revision is
   to provide a starting point to discuss requirements for an IPFIX-
   based file format, and the applicability of the work proposed herein
   to the mission of the IPFIX Working Group. [bht]

   The survey of existing file formats is incomplete, and includes only
   file formats with which one of the authors has personal experience.
   [bht]

   There should be a mention of the zero-length message and set hacks
   (from -00) somewhere in this draft, to support the IPFIX header on
   legacy fixed-length binary file use case. [bht]


11.  References

11.1.  Normative References

   [1]  Claise, B., "IPFIX Protocol Specification",
        draft-ietf-ipfix-protocol-22 (work in progress), June 2006.

11.2.  Informative References

   [2]  Quittek, J., Zseby, T., Claise, B., and S. Zander, "Requirements
        for IP Flow Information Export (IPFIX)", RFC 3917, October 2004.

   [3]  Bradner, S., "Key words for use in RFCs to Indicate Requirement
        Levels", BCP 14, RFC 2119, March 1997.


Trammell, et al.        Expires December 25, 2006               [Page 9]

Internet-Draft                 IPFIX Files                     June 2006


Authors' Addresses

   Brian H. Trammell
   CERT Network Situational Awareness
   Software Engineering Institute
   4500 Fifth Avenue
   Pittsburgh, PA  15213
   United States

   Phone: +1 412 268 9748
   Email: bht@cert.org


   Elisa Boschi
   Hitachi Europe SAS
   Immueble Le Theleme
   1503 Route les Dolines
   Valbonne  06560
   France

   Phone: +33 4 89874180
   Email: elisa.boschi@hitachi-eu.com


   Lutz Mark
   Fraunhofer Institute for Open Communication Systems
   Kaiserin-Augusta-Allee 31
   Berlin  10589
   Germany

   Phone: +49 30 3463 7306
   Email: mark@fokus.fraunhofer.de


   Tanja Zseby
   Fraunhofer Institute for Open Communication Systems
   Kaiserin-Augusta-Allee 31
   Berlin  10589
   Germany

   Phone: +49 30 3463 7153
   Email: zseby@fokus.fraunhofer.de


Trammell, et al.        Expires December 25, 2006              [Page 10]

Internet-Draft                 IPFIX Files                     June 2006


Intellectual Property Statement

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.


Disclaimer of Validity

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Copyright Statement

   Copyright (C) The Internet Society (2006).  This document is subject
   to the rights, licenses and restrictions contained in BCP 78, and
   except as set forth therein, the authors retain all their rights.


Acknowledgment

   Funding for the RFC Editor function is currently provided by the
   Internet Society.


Trammell, et al.        Expires December 25, 2006              [Page 11]