Internet  Draft                                     Andre Noel Tremblay
Document: draft-tremblay-bom-00.txt
Category: Experimental
Expires:  November 28 2003                                  May 28 2003
    
                  Standardized Byte Order Mark (BOM)
            for plain text documents formats recognition.
    
Status of this Memo 

     This document is an Internet-Draft and is subject to
     all provisions of Section 10 of RFC2026.

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as
     Internet-Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-
     Drafts as reference material or to cite them other than as
     "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/1id-abstracts.html

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html
    
    
Abstract 

   This document propose a way to standardize the Byte Order Mark (BOM)
   for plain text documents, involving few changes in the current
   protocoles and softwares, using already established protocoles.
   And, to include informations useful to process documents.

Table of Contents

1. Introduction

  There are 3 variables making plain text documents decoding
  problematic:

    1- the codes pages, which can be differents from one document
       to an other;

    2- the end of line, which can be LF-CR, CR or LF
       maintly dependent on the operating system;

    3- for multibytes coding,
       the big-endian or little-endian sequences,
       maintly dependent on the computer cpu.

  The use of a mark at the beginning of a file, which contains plain
  text, to identify the coding format of the characters, is commonly
  refere to as the Byte Order Mark or BOM for short.

  This repeat the process used by many files formats, to have
  a signature at the beginning of the file.

  It is assumed that to be usefull, this marking need to used values
  that is not part of the coding of the text. As it is with UTF coding,
  where the BOM values are from unused code points.

  Problems are created by adding this mark because it wont be
  recognized correctly by unawared current softwares. And because
  merging of files will include the BOM as part of the merged file.

  Problems which dont need to be, because a protocol already existed
  in the codings derived from ASCII protocol, to store the needed
  informations without causing problems.

2- The use of escape sequence as BOM.

   The escape sequence (or "escape code") is a string of codes starting
   with the character (Esc) ASCII 27 (0x1B) (coded 01/11).

   The protocol for Escape sequence defined by ISO/IEC 2022, include
   a codepage switch function where the values are defined by
   ISO/IEC International Register of Coded Character Sets To Be Used
   With Escape Sequences for information interchange in data processing.

   An escape sequence consists of two or more bytes, ruled by
   the ISO/IEC 2022 format: Esc I F

       * Esc: the first is the ESCAPE character (Esc) (coded 01/11)

       * F:   the last, known as the Final Byte,
              is from columns 03 to 07 of the code table,
              excluding the DELETE character 07/15).

       * I:   Any bytes between the ESCAPE and the Final Byte
              are known as Intermediate Bytes
              and are from column 02 of the code table. 

   Having an escape sequence at the beginning of a file to switch to
   the proper codes page of the document, solve the problem of
   identifying the codes page of a document. This, without changes
   having to be done to software, even when documents with diffrents
   codes pages are merged.

   For already compliant ISO/IEC 2022 software this level of
   implantation is already achieved for unibyte coding.


3- The multibytes considerations.

   For unibyte coding no changes are needed to implement.

   When the coding is multibytes, the escape code will have
   differents sequences in UCS-4 form:

     00 1B  00 1B  2 bytes for big-endian
     1B 00  1B 00  2 bytes for little-endian
     00 00  00 1B  4 bytes for big-endian on words
     00 00  1B 00  4 bytes for little-endian on bytes
     00 1B  00 00  4 bytes for big-endian on words
     1B 00  00 00  4 bytes for big-endian on bytes and words


   The double escape code in 2 bytes coding, is to be able
   to differenciate with cases from 4 bytes codieng. This apart from
   ISO/IEC 2022 rules, but without conflict, the I bytes haning to be
   from from column 02 of the code table, and Esc is from column 1.

   This can be used to know which case is involved. But it can work
   only if the codes point 1B00 abd 1B00 0000, which are unassigned
   code point, are reserved as little-endian variants.

   Which can be implanted with few changes.

4- The end of the line issue.

   There are, at my knowledge, no in-text protocole defined to identify
   the end of the line convention used inside a document.

   But an extension of the escape sequence can be done to do so.

   As it would be a better way, than the flags from UNICODE
   codes point, to extend the escape sequence, to specify other needed
   informations, like language, locales, time zone, and so on.

   The extension can be by using the column 0 and 1,
   which redefines the Intermediate Bytes:

       * I:   Any bytes between the ESCAPE and the Final Byte
              are known as Intermediate Bytes,
              and are from column 00 to 02 of the code table.

       With the following conventions:

              I     Is followed by
              --    --------------

                  * Locales tags, followed by 4 bytes which contains
                                 an identification number for the flag.

              11    language tag
              12    country tag
              13    time zone tag
              14    calendar tag

                  * End of line coding

              0D    CARRIAGE RETURN (CR) used for end of line
              0A    LINE FEED (LF) used for end of line
              0D0A  (CR)(LF) used for end of line
              0A0D  (LF)(CR) used for end of line


Security Considerations 
    
   No security issues are involved.

Author's Addresses

   Andre Noel Tremblay
   2419-6 St Dominique
   Jonquiere (Quebec)
   Canada

   G7X6L1

   Phone: (418) 542-9053
   Email: andre.n.tremblay@videotron.ca

tremblay               Expires - November 28 2003              [Page 1]

internet Draft            draft-tremblay-bom-00.txt            May 2003