INTERNET-DRAFT F. Lundberg Expires: 17 November 2003 Linova Category: Standards Track May 2003 BaseStream - A Simple Typed Stream Protocol draft-flundberg-basestream-00.txt This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Status of this Memo This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract BaseStream is a simple light-weight binary protocol consisting of a sequence of typed and possibly named data elements. It can be used whenever sequences of bytes are handled, for example, as a base for a TCP protocol or a file format. An instance the BaseStream protocol has a corresponding XML representation that makes it possible to use existing XML software to view, edit, validate, and transform BaseStream data. Lundberg Standards Track [Page 1] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. BaseStream Definition . . . . . . . . . . . . . . . . . . . . 4 2.1 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Protocol Syntax . . . . . . . . . . . . . . . . . . . . . . . 5 2.3 Future Versions . . . . . . . . . . . . . . . . . . . . . . . 9 2.4 Protocol Element . . . . . . . . . . . . . . . . . . . . . . . 9 3. XML Representation . . . . . . . . . . . . . . . . . . . . . . 10 4. Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1 The use of the e-Element . . . . . . . . . . . . . . . . . . . 12 4.2 Byte Order . . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.3 Character Encoding . . . . . . . . . . . . . . . . . . . . . . 12 4.4 Element Names . . . . . . . . . . . . . . . . . . . . . . . . 12 4.5 BXML B-element Representation . . . . . . . . . . . . . . . . 13 5. Security Considerations . . . . . . . . . . . . . . . . . . . 14 Normative References . . . . . . . . . . . . . . . . . . . . . 15 Informative References . . . . . . . . . . . . . . . . . . . . 16 Author's Address . . . . . . . . . . . . . . . . . . . . . . . 16 A. XML Schema for BXML types . . . . . . . . . . . . . . . . . . 17 Intellectual Property and Copyright Statements . . . . . . . . 23 Lundberg Standards Track [Page 2] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 1. Introduction Binary data is difficult to handle by humans. Therefore text-based protocols such as SMTP and HTTP are popular. The software for these protocols are easier to debug since they use a relatively human-friendly communication protocol. However, text-based protocols have the following drawbacks compared to binary protocols. o Text-based protocols need more bytes to send non-text-based data. o Parsing of a text-based protocol is not efficient compared to parsing of binary data since the same data value can be represented in many ways. For example: "12.3" is normally the same as "12.300" or "1.23e1" in a text-based protocol. o Another drawback is that it is seldom possible to read only part of a text-based file since the exact byte offset to a data item is generally not known. BaseStream is a light-weight, simple, binary protocol consisting of a sequence of typed and possibly named data elements. The protocol is a data serialization format suitable as the base for file formats, TCP protocols, or any protocol that handles sequences of bytes. Together with appropriate tools a BaseStream can easily be viewed and edited. BaseStream data is therefore relatively human-friendly, but without the drawbacks of text-based protocols. This document specifies an XML representation of a BaseStream. XML stands for Extensive Markup Language and is defined in [1] The BaseStream XML application is called BXML. By using software to convert data between binary BaseStream and BXML, any text or XML editor can be used to edit BaseStream data. Furthermore this transformation opens up BaseStream data to other XML technologies such as: data validation with XML Schemas and data transformation with XSLT (XML Stylesheet Language Transformation). This document specifies two things. 1. What a BaseStream is and the rules for determining whether a stream is a BaseStream or not. 2. How a BaseStream is represented in XML format. The goal of the BaseStream protocol is to make it easy to share binary data and develop applications that use a binary protocol. Lundberg Standards Track [Page 3] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 2. BaseStream Definition 2.1 Notations This subsection defines the specific meaning of some words in the context of this document. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [8]. A byte is an 8-bit data entity. A STREAM is a finite sequence of bytes. A STREAM is created by a WRITER. A WRITER is an entity capable of producing a sequence of bytes and a BaseStreamWriter is a WRITER that produces a BaseStream. A READER reads a STREAM created by a WRITER. A BaseStreamReader is a READER that can read a BaseStream and interpret the bytes as elements as described in later in this document. A protocol is a set of rules for a STREAM that determines whether or not the STREAM adheres to the protocol. "BaseStream" is used to refer either to the protocol defined in this document or a STREAM that adheres to this protocol. The terms "BaseStream protocol" and "BaseStream stream" can be used to discriminate the two cases. "BaseStream1" is this version, that is, version 1 of the protocol. A "protocol instance" is a concrete instance of a STREAM that follows a specific protocol. A "BaseStream instance" is a STREAM that follows the rules of the BaseStream protocol. A "BaseStream application" is a specific protocol based on the BaseStream protocol. A BaseStream application may put any further restrictions on the STREAM as long as every instance of the application is a BaseStream instance. Big-endian byte order specifies that multi-byte integers and floating point numbers are written to a stream or stored in memory with the most significant byte first. If little-endian byte order is used the most significant byte is the last byte. The following abbreviations are used for basic data types. o INT1, a 8-bit signed integer. Lundberg Standards Track [Page 4] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 o INT2, a 16-bit signed integer. o INT4, a 32-bit signed integer. o INT8, a 64-bit signed integer. o FLOAT4, a 4-byte floating point number. o FLOAT8, an 8-byte floating point number. The integer types are 8-bit, 16-bit, 32-bit and 64-bit signed two's-complement integers. The FLOAT4 and FLOAT8 are 4- and 8-byte data entities used to store floating point numbers. These follow the single-precision 32-bit and double-precision 64-bit formats in the IEEE754 standard [2]. For all the numerical types big-endian byte order is used. The integers in this document are by default decimal, if hexadecimal notation is used the integers are prefixed with "0x". A byte value may be specified with the corresponding Unicode [5] character in the interval from 0 to 127. For example: 'a' is the value 97 or 0x61. These unicode character values coincides with the ASCII [9] character values. 2.2 Protocol Syntax A BaseStream consists of a sequence of data elements. The elements can be of a simple type, an array type or the string type. The element type is specified by the type byte (an ASCII character) which is one of the following: b, s, i, l, f, d, B, S, I, L, F, D, or U. The type byte is also called the type character since the byte corresponds to a character. The type byte is the first byte in an unnamed element. For a named element it is the first byte after the name of the element. There are six simple element types. The simple integer types are the b, s, i and l-element types. The letters are abbreviations for "byte", "short", "int", and "long" which are type names in widely used programming languages. These elements stores INT1, INT2, INT4, and INT8 values respectively. The two types for floating point data are the f-element and the d-element which stores FLOAT4 and FLOAT8 data. f and d are abbreviations of "float" and "double". The array types are the B, S, I, L, F, and D-element types. These may also be referred to as the B array element type, S array element type, and so on. These types are used to store an array of INT1's, Lundberg Standards Track [Page 5] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 INT2's, INT4's, INT8's, FLOAT4's or FLOAT8's. Arrays with zero elements are allowed. Every element may optionally be named. The element name uses a subset of the US-ASCII charset. The reason for the strong limitations on the element names is that the the name should be possible to print on all computer systems. The element name is stored with ASCII [9] encoding. Augmented Backus-Naur Form (ABNF) as defined in RFC2234 [3] is used to specify the grammar of a BaseStream. See the figure below. BaseStream1 = Element0 *element e Element0 = i %d0 %d3 %56 %d1 ; byte 'i' followed by an INT4 with value 256001 element = [elementName] (simple / array / string) simple = b INT1 / ; b-element s INT2 / ; s-element i INT4 / ; i-element l INT8 / ; l-element f FLOAT4 / ; f-element d FLOAT8 ; d-element array = B size *INT1 / ; B-element S size *INT2 / ; S-element I size *INT4 / ; I-element L size *INT8 / ; L-element F size *FLOAT4 / ; F-element D size *FLOAT8 ; D-element string = U size utf8 utf8 = size = shortSize / longSize shortSize = longSize = minus8 longSizeNumber minus8 = Lundberg Standards Track [Page 6] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 longSizeNumber = elementName = N nameSize nameString nameSize = nameString = ALPHA 0*126digitLetterOrUnderscore i256001 = %d105 %d0 %d3 %d56 ; an INT4 with value 256001 digitLetterOrUnderscore = ALPHA / DIGIT / UNDERSCORE ALPHA = %x41-5A / %x61-7A ; A-Z / a-z UNDERSCORE = %x5F ; '_' DIGIT = "0" / "1" / "2" / "3" / "4" / "5" / "6" / "7" / "8" / "9" e = %x65 N = %x4E b = %x62 s = %x73 i = %x69 l = %x6C f = %x66 d = %x64 B = %x42 S = %x53 I = %x49 L = %x4C F = %x46 D = %x44 U = %x55 INT1 = INT2 = INT4 = INT8 = FLOAT4 = FLOAT8 = We can now list the rules that completely defines the BaseStream protocol. A STREAM is a BaseStream of version 1 (a BaseStream1) if and only if all the following rules are fulfilled. 1. The stream MUST follow the BaseStream1 grammar rule. 2. The nameSize integer MUST equal the number of bytes (characters) in the following nameString. Lundberg Standards Track [Page 7] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 3. The size integer (grammar rule "size") MUST equal the number of integers or floating point numbers that follow for the B, S, I, L, F, and D elements. For the U-element the size MUST equal the number of bytes needed to encode the UTF-8 string. 4. If a U-element is named "bs_tag" the string value of the element MUST follow the "nameString" grammar rule. Such an element is called a tag-element. 5. If a U-element is named "bs_end" the string value MUST be the empty string (a string with zero characters). Such an element is called an end-element. 6. At any point in the stream the number of end-elements written to the stream MUST NOT exceed the number of tag-elements. 7. The total number of end-elements in a BaseStream MUST equal the number of tag-elements. 8. If Element1 is a U-element named "protocol" the string value of the element MUST be the name of the BaseStream application. The tag-element and the end-element described above affects how the BaseStream is represented as XML. This is treated later in this document. A BaseStream always starts with a byte of value 105 corresponding to 'i' in the ASCII character set [9]. Then there are four bytes which forms an INT4 with the value 256001. Thus a BaseStream1 always start with the following five bytes: 105, 0, 3, 56, 1. These bytes are the first element and are called Element0. The purpose of Element0 is to identify a STREAM as a BaseStream. After Element0, there are 0 to infinitely many user data elements called Element1, Element2, and so on. After the elements in the stream there is one byte indicating the end of the stream. This is 'e' (101). The array types and the string type have a size. This size is the number of integers or floating point numbers in the array for the array types. For the string type the size is the number of bytes needed to encode the string in UTF-8. The size is stored in two ways. See the grammar rules "shortSize" and "longSize". If the size is between 0 and 127 it is stored as an INT1. If the size is larger than or equal to 128 it is stored as an INT8. Strings are stored using the Unicode [5] character set that handles all widely used characters. After the string element type byte ('U') and the size there are bytes that represents a string in the UTF-8 format. See RFC 2279 [4] for information about the UTF-8 charset Lundberg Standards Track [Page 8] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 encoding. 2.3 Future Versions Future versions of the BaseStream protocol are not currently anticipated, but still prepared for. BaseStreamX is used to denoted version X of BaseStream where X is 2, 3, and so on. Version X of the protocol MUST start with with the byte 'i' and then an INT4 with the value 256000 + X. Any version of BaseStream will thus always start with the four bytes: 105, 0, 3, 56. The fifth byte is the BaseStream version. A version number higher than 127 will never be used. 2.4 Protocol Element BaseStream provides a standard way of storing the name of the BaseStream protocol in the beginning of the stream. If Element1 of a BaseStream is a U-element named "protocol" the value of this element is the name of the protocol. If a BaseStream application contains the name of the protocol in the data it SHOULD be stored as Element1 described above. It is RECOMMENDED that the charset for the protocol name is US-ASCII. To provide unique protocol names and some information about the origin of the protocol it is suggested that a URL is used as the protocol name. For example the company X could call their 2D plot format "http://www.x.com/plot2d1" where "1" stands for version 1 of the protocol. Lundberg Standards Track [Page 9] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 3. XML Representation This section specifies an XML representation of a BaseStream called BaseStream XML or BXML. The XML Schema standard ([6] and [7]) is used to define the exact syntax of the BXML elements. Any BaseStream can be converted to BXML and any BXML document can be converted to binary BaseStream format. A BaseStream converted to BXML and then back to a binary BaseStream is identical to the original BaseStream. Below is an example of a BXML document that stores information about a plot. 256001 http://www.x.com/plot2d1/ Position vs time time (s) pos (m) 1.0 2.0 3.0 4.0 0.4 1.5 2.0 1.8 A BXML document always has a root XML element called "BaseStream". The first child of the root element is "256001" which corresponds to Element0 in the BaseStream. After this the following BaseStream elements are converted to XML by the following rules. The letter X is used to denote the type character of the element. 1. Any unnamed element is stored as an XML element with the same name as the type character. content The syntax of the X-element content is defined by the XML Schema type called X-type. See Appendix A. No XML attributes are allowed. 2. A named element which is not a tag- or end-element is stored as an XML element with a name identical to the element name. The attribute "type" is required and the value of the attribute must be the type character of the element. content Lundberg Standards Track [Page 10] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 No other attributes then the type attribute is allowed. The exact syntax of X-element content is defined by the XML Schema rule called "named-X-type". See Appendix A. 3. If the element is a tag-element an unclosed start XML tag is added to the XML document. The name of the element is the string value of the tag-element. No XML attributes are allowed. 4. If the element is an end-element the last XML start tag closed by a corresponding end tag. The exact syntax of the character content of the XML elements is defined in the XML Schema document in Appendix A. For simple types the corresponding XML character contents are represented as a decimal number. BaseStream to BXML converters are encouraged to use the canonical lexical representation of these values as defined in the Schema standard [7]. Array types are written as a sequence of whitespace separated representations of the corresponding simple type. There is one exception, the B-element is written as a whitespace separated sequence of pairs of characters that represents a hexadecimal number between 0 and 255. The bytes in the B-element are considered unsigned in this context. The figure below shows an XML Schema that defines the plot2d protocol for the example given above. The Schema types are left out for clarity. These types are defined in Appendix A. Lundberg Standards Track [Page 11] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 4. Discussion This section is not normative and is provided as a basis for further discussion. Much of the section will probably be removed before the document is published as an RFC. Send comments to frans@linova.com. 4.1 The use of the e-Element The end of stream indication 'e' may seem redundant at first since STREAM's generally provide other ways to indicate the end of stream condition. However the 'e' element is important when a BaseStream is inserted within another STREAM. The normal end of stream condition will then be raised at the end of the enclosing STREAM, not the end of the BaseStream and therefore the 'e' end indicator is needed. 4.2 Byte Order The question of whether to use big-endian, little-endian or both byte orders in BaseStream is a bit tricky. Both byte orders are heavily used and there is no large technical advantage with any of the two possible byte orders. BaseStream could have supported both orders, but since BaseStream is intended to be as simple as possible, this option was ruled out. Big-endian byte order was finally chosen since it is the traditional network byte order. Often the time it takes for a computer CPU to change the byte order of data is lower than the time to read or write data from a network or a file. Thus, for most cases the byte order is not very important. 4.3 Character Encoding When storing strings the Unicode character set should be used to be able to represent all the widely used characters in the world. How to store the Unicode characters as bytes in a stream is however a less obvious decision. RFC 2279 (IETF Policy on Character Sets and Languages) [4] states that "protocols MUST be able to use the UTF-8 charset". BaseStream supports storing characters in this format. No other character encoding is supported to keep the protocol as simple as possible. 4.4 Element Names The element names are much restricted. The reason for this is the these names are intended to be used as part of the protocol not the actual data. It is essential that these names can be printed on as many systems as possible. Also the length of the element are restricted so they can easily be handled in memory. Lundberg Standards Track [Page 12] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 Another reason for the restrictions on element names is that the names are intended to be possible to use as variable names in source code. This makes it easier to automatically generate source code for readers and writers for a specific BaseStream application. 4.5 BXML B-element Representation How should an array of bytes be written as a string? The choice is to write it as a sequence of pairs of hexadecimal characters. Each pair represents an unsigned byte value. The bytes could also have been represented by a sequence of signed bytes with decimal notation to follow the representation for the other integer array types. The author of this document believes that that use of unsigned hexadecimal notation increases the readability and also makes the representation more compact. Lundberg Standards Track [Page 13] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 5. Security Considerations The BaseStream protocol itself raises no security considerations, but a badly implemented BaseStreamReader may. BaseStream applications may limit the size of arrays and string elements to be easily handled in memory by the anticipated BaseStreamReaders. However, a BaseStreamReader should be able to handle sizes up to 2^63 gracefully. A simple implementation of a BaseStreamReader may always allocate new memory for the next element to read. This can result in an out of memory error that possibly crashes the reader process if the size of an element is too large. An attacker could of course write any value for the size without necessarily transmitting the data that should follow and thereby possibly crash a badly implemented BaseStreamReader. Lundberg Standards Track [Page 14] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 Normative References [1] World Wide Web Consortium, "Extensible Markup Language (XML) 1.0 (Second Edition)", W3C XML, October 2000, . [2] IEEE, "Standard for Binary Floating-Point Arithmetic, Standard No.: 754-1985", 1985. [3] Crocker, D., "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997. [4] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC 2279, January 1998. [5] The Unicode Consortium, "The Unicode Standard, Version 4.0.0, defined by: The Unicode Standard, Version 4.0 (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1)", 2003, . [6] World Wide Web Consortium, "XML Schema Part 1: Structures, W3C Recommendation 2 May 2001", May 2001, . [7] World Wide Web Consortium, "XML Schema Part 2: Datatypes, W3C Recommendation 02 May 2001", May 2001, . [8] Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, BCP 14, March 1997. Lundberg Standards Track [Page 15] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 Informative References [9] ANSI, "Coded Character Set--7-Bit American Standard Code for Information Interchange, ANSI X3.4-1986.", 1986. Author's Address Frans Lundberg Linova Phone: +46 70 7601861 EMail: frans@linova.com URI: http://www.linova.com Lundberg Standards Track [Page 16] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 Appendix A. XML Schema for BXML types Lundberg Standards Track [Page 17] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 Lundberg Standards Track [Page 18] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 Lundberg Standards Track [Page 19] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 Lundberg Standards Track [Page 20] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 Lundberg Standards Track [Page 21] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 Lundberg Standards Track [Page 22] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 Intellectual Property Statement The IETF takes no position regarding the validity or scope of any intellectual property or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; neither does it represent that it has made any effort to identify any such rights. Information on the IETF's procedures with respect to rights in standards-track and standards-related documentation can be found in BCP-11. Copies of claims of rights made available for publication and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementors or users of this specification can be obtained from the IETF Secretariat. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights which may cover technology that may be required to practice this standard. Please address the information to the IETF Executive Director. Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assignees. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION Lundberg Standards Track [Page 23] RFC nnnn BaseStream - A Simple Typed Stream Protocol May 2003 HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. Lundberg Standards Track [Page 24]