Internet Draft Andre Noel Tremblay Document: draft-tremblay-bom-00.txt Category: Experimental Expires: November 28 2003 May 28 2003 Standardized Byte Order Mark (BOM) for plain text documents formats recognition. Status of this Memo This document is an Internet-Draft and is subject to all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.html The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Abstract This document propose a way to standardize the Byte Order Mark (BOM) for plain text documents, involving few changes in the current protocoles and softwares, using already established protocoles. And, to include informations useful to process documents. Table of Contents 1. Introduction There are 3 variables making plain text documents decoding problematic: 1- the codes pages, which can be differents from one document to an other; 2- the end of line, which can be LF-CR, CR or LF maintly dependent on the operating system; 3- for multibytes coding, the big-endian or little-endian sequences, maintly dependent on the computer cpu. The use of a mark at the beginning of a file, which contains plain text, to identify the coding format of the characters, is commonly refere to as the Byte Order Mark or BOM for short. This repeat the process used by many files formats, to have a signature at the beginning of the file. It is assumed that to be usefull, this marking need to used values that is not part of the coding of the text. As it is with UTF coding, where the BOM values are from unused code points. Problems are created by adding this mark because it wont be recognized correctly by unawared current softwares. And because merging of files will include the BOM as part of the merged file. Problems which dont need to be, because a protocol already existed in the codings derived from ASCII protocol, to store the needed informations without causing problems. 2- The use of escape sequence as BOM. The escape sequence (or "escape code") is a string of codes starting with the character (Esc) ASCII 27 (0x1B) (coded 01/11). The protocol for Escape sequence defined by ISO/IEC 2022, include a codepage switch function where the values are defined by ISO/IEC International Register of Coded Character Sets To Be Used With Escape Sequences for information interchange in data processing. An escape sequence consists of two or more bytes, ruled by the ISO/IEC 2022 format: Esc I F * Esc: the first is the ESCAPE character (Esc) (coded 01/11) * F: the last, known as the Final Byte, is from columns 03 to 07 of the code table, excluding the DELETE character 07/15). * I: Any bytes between the ESCAPE and the Final Byte are known as Intermediate Bytes and are from column 02 of the code table. Having an escape sequence at the beginning of a file to switch to the proper codes page of the document, solve the problem of identifying the codes page of a document. This, without changes having to be done to software, even when documents with diffrents codes pages are merged. For already compliant ISO/IEC 2022 software this level of implantation is already achieved for unibyte coding. 3- The multibytes considerations. For unibyte coding no changes are needed to implement. When the coding is multibytes, the escape code will have differents sequences in UCS-4 form: 00 1B 00 1B 2 bytes for big-endian 1B 00 1B 00 2 bytes for little-endian 00 00 00 1B 4 bytes for big-endian on words 00 00 1B 00 4 bytes for little-endian on bytes 00 1B 00 00 4 bytes for big-endian on words 1B 00 00 00 4 bytes for big-endian on bytes and words The double escape code in 2 bytes coding, is to be able to differenciate with cases from 4 bytes codieng. This apart from ISO/IEC 2022 rules, but without conflict, the I bytes haning to be from from column 02 of the code table, and Esc is from column 1. This can be used to know which case is involved. But it can work only if the codes point 1B00 abd 1B00 0000, which are unassigned code point, are reserved as little-endian variants. Which can be implanted with few changes. 4- The end of the line issue. There are, at my knowledge, no in-text protocole defined to identify the end of the line convention used inside a document. But an extension of the escape sequence can be done to do so. As it would be a better way, than the flags from UNICODE codes point, to extend the escape sequence, to specify other needed informations, like language, locales, time zone, and so on. The extension can be by using the column 0 and 1, which redefines the Intermediate Bytes: * I: Any bytes between the ESCAPE and the Final Byte are known as Intermediate Bytes, and are from column 00 to 02 of the code table. With the following conventions: I Is followed by -- -------------- * Locales tags, followed by 4 bytes which contains an identification number for the flag. 11 language tag 12 country tag 13 time zone tag 14 calendar tag * End of line coding 0D CARRIAGE RETURN (CR) used for end of line 0A LINE FEED (LF) used for end of line 0D0A (CR)(LF) used for end of line 0A0D (LF)(CR) used for end of line Security Considerations No security issues are involved. Author's Addresses Andre Noel Tremblay 2419-6 St Dominique Jonquiere (Quebec) Canada G7X6L1 Phone: (418) 542-9053 Email: andre.n.tremblay@videotron.ca tremblay Expires - November 28 2003 [Page 1] internet Draft draft-tremblay-bom-00.txt May 2003