INTERNET-DRAFT
Expiration date: October 2002.

Network Working Group                                          C. Kostin
Request for Comments: nnnn
Category: Standards Track                               Date: April 2002

             Character Properties For Character References
    (Suggested file-name: draft-kostin-character-properties-00.txt)

Status of this Memo

         This document is an Internet-Draft and is subject to
         all provisions of Section 10 of RFC 2026 (in spite of
         this section is so great that I, probably, have not
         yet entirely read it).

     Internet-Drafts are working documents of the Internet Engineering
     Task Force (IETF), its areas, and its working groups.  Note that
     other groups may also distribute working documents as
     Internet-Drafts.

     Internet-Drafts are draft documents valid for a maximum of six
     months and may be updated, replaced, or obsoleted by other
     documents at any time.  It is inappropriate to use Internet-
     Drafts as reference material or to cite them other than as
     "work in progress."

     The list of current Internet-Drafts can be accessed at
     http://www.ietf.org/1id-abstracts.html

     The list of Internet-Draft Shadow Directories can be accessed at
     http://www.ietf.org/shadow.html

________________________________________________________________________
Copyright Notice

     Copyright (C) The Internet Society (2002).  All Rights Reserved.

________________________________________________________________________
Abstract

     (Backus-Naur Form (BNF) is used hereinafter.)
     This document describes an enhancing of symbolic name way of 
character references in HTML by adding to a character symbolic name an 
optional character property in the following manner: BNF: "&<character 
symbolic name>[<separator><character property>];".


Kostin                        Standards Track                   [Page 1]

RFC NNNN    Character Properties For Character References     April 2002


________________________________________________________________________
Introduction

     Dear Sirs,

     It seems likely that method of encoding characters in operating 
systems becomes more and more similar to character entity references 
implemented in HTML by notation: BNF: "&<character symbolic name>;" (see 
ref. [1] below) (notation, probably, originated from SGML (for SGML, see 
ref. [3] below)). Especially this process becomes apparent in UTF-8 
sequences. At the days of writing this document the symbolic name is not 
yet become a method of encoding in operating systems. Probably, the way 
of implementing this method inside operating systems lies over prior 
trial in HTML.

     This document describes an enhancing character reference method 
(the method by which characters are encoded in HTML (see ref. [1] 
below)) by supplying a character reference with an optional character 
property. The character property gives/bears additional information 
about a character specified by a character reference; the information 
which is rather applicable to that single character only then to a 
string of text containing this character. In particular, this means that 
a value of some character property can differ just within one character 
of a word (of a sentence or a paragraph) and therefore (but not only) 
setting this information by character property way is more convenient 
then by such a way as, for example, HTML-tags. Another reason why some 
information could be set by character property is necessity to keep that 
information in as short text fragment as just a word, i. e. to provide 
ability to handle this information without parsing the whole document.


Kostin                        Standards Track                   [Page 2]

RFC NNNN    Character Properties For Character References     April 2002


________________________________________________________________________
CHAPTER 1 - NOTATIONAL CONVENTIONS


(001.001.001.0001 - Chapter 1, division 1, section 1, paragraph 1)

     Backus-Naur Form (BNF) notation is used in this document.
<...> - means that enclosed expression is a definition of some element.
[...] - means that enclosed expression is an optional element.
{...} - means that enclosed expression is repeated 0 (zero) or more
        times.
<ABL> - defines left angle bracket i. e. "<".
<ABL> - defines right angle bracket i. e. ">".
<SBL> - defines left square bracket i. e. "[".
<SBR> - defines right square bracket i. e. "]".
<braceL> - defines left brace i. e. "{".
<braceR> - defines right brace i. e. "}".


Kostin                        Standards Track                   [Page 3]

RFC NNNN    Character Properties For Character References     April 2002


________________________________________________________________________
CHAPTER 2 - CHARACTER PROPERTIES FOR CHARACTER REFERENCES


                          CHAPTER 2, DIVISION 1
                        CHARACTER PROPERTY SYNTAX


(002.001.001.0001 - Chapter 2, division 1, section 1, paragraph 1)

     The character property is added after specified symbolic name 
separated from the symbolic name (or from another character property) by 
a special character, property's separator.

(002.001.001.0002 - Chapter 2, division 1, section 1, paragraph 2)

     NOTE: A hyphen, minus ("-") sign, is reserved for using inside the 
character property, for example to separate subtags of a language name 
as in RFC 3066 (see ref. [5] below), and also for using inside symbolic 
names (see also 002.001.001.0004).

(002.001.001.0003 - Chapter 2, division 1, section 1, paragraph 3)

     BNF: &<symbolic name>{<separator><character property>};

(002.001.001.0004 - Chapter 2, division 1, section 1, paragraph 4)

     The character property always is an optional parameter; otherwise 
it is not already a property of a character; it is then a part of a 
symbolic name of a character reference.

(002.001.001.0005 - Chapter 2, division 1, section 1, paragraph 5)

     A value of the character property is applied to a specified 
character only and has a higher priority then values set by such 
elements as for example HTML-tags.

(002.001.001.0006 - Chapter 2, division 1, section 1, paragraph 6)

     NOTE: It is not recommended to override default values by using the 
character property statement because of difficulties arising for web-
searching for characters with a specific character property value and 
because of possible conflict with values of another origin for example 
with values of HTML-tags. See also Introduction.


Kostin                        Standards Track                   [Page 4]

RFC NNNN    Character Properties For Character References     April 2002


002.001.002 - CHARACTERS USED TO COMPOSE THE CHARACTER PROPERTY ITSELF


(002.001.002.0001 - Chapter 2, division 1, section 2, paragraph 1)

     The character property is composed of ASCII characters, the 
characters which form a primordial, thoroughly tested set of characters 
for Internet so far (for ASCII, see ref. [4] below), excluding:
1. (excluding) ASCII-characters used as character property separators
   (to avoid confusion with the beginning of another character 
   property) (e. g. a comma (",") is a language property separator),
2. (excluding) the "&" ASCII-character
   (used as delimiter of the beginning of another character reference 
   (ref. [1] below)),
3. (excluding) the ";" ASCII-character
   (used as delimiter of the end of the character reference (ref. [1] 
   below)).

(002.001.002.0002 - Chapter 2, division 1, section 2, paragraph 2)

     So list of characters surely allowable for now for composing 
character properties looks as follows:
     A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
     a b c d e f g h i j k l m n o p q r s t u v w x y z
     0 1 2 3 4 5 6 7 8 9 -
     and also following characters, just to make sure, (they are ASCII-
characters too):
     _ (low line) ' (apostrophe)

(002.001.002.0003 - Chapter 2, division 1, section 2, paragraph 3)

     SOAP NOTE: There are some ASCII-characters used in CGI form data 
sets. They are: the "%" ASCII-character (indicates a beginning of a 
character triplet notation), the "+" ASCII-character (replaces in form 
data set a space characters). (See ref. [6] below.) Also, there is the 
character "#" which is used in World Wide Web and in other systems to 
separate a URL from a fragment/anchor identifier. Such characters are 
often called unsafe. But it doesn't matter that these characters could 
be presented in a character reference because they may be encoded by a 
character triplet: BNF: "%<two hexadecimal digits code of ASCII-
character>". E. g. a plus ("+") sign may be represented by the "%2B". 
Unlike good UNICODE-characters, for any ASCII-character, two hex. digits 
are completely enough to represent its code.

(002.001.002.0004 - Chapter 2, division 1, section 2, paragraph 4)

     Character properties are case-sensitive, syntactically, because XML 
is case-sensitive.


Kostin                        Standards Track                   [Page 5]

RFC NNNN    Character Properties For Character References     April 2002


Security Considerations

     This document raises no security issues.


Informative References

[1]  HTML 4.01 Specification. 5.3.2 Character Entity References.
     - http://www.w3.org/TR/html4/charset.html#h-5.3.2 .

[2]  The Unicode Standard, (Version 3.0 - ISBN 0-201-61633-5) 
     - http://www.unicode.org/unicode/uni2book/u2.html .

[3]  ISO/IEC 8879 - ISO (International Organization for 
     Standardization). ISO/IEC 8879-1986 (E). Information processing -
     Text and Office Systems - Standard Generalized Markup Language 
     (SGML). First edition - 1986-10-15. [Geneva]: International 
     Organization for Standardization, 1986. 

[4]  Information Systems. Coded Character Sets. 7-Bit American National
     Standard Code for Information Interchange (7-Bit ASCII). - ANSI 
     X3.4-1986. 
     - ($$) http://webstore.ansi.org/ansidocstore/product.asp?sku=ANSI+
       X3%2E4%2D1986+%28R1997%29 .

[5]  Tags for the Identification of Languages. - RFC 3066 (obsoletes RFC 
     1766). H. Alvestrand. Cisco Systems, January 2001.
     - http://www.ietf.org/rfc/rfc3066.txt

[6]  HTML 4.01 Specification. 17.13.4 Form Content Types.
     - http://www.w3.org/TR/html4/charset.html#form-content-type .


   Author's Address

     Cyril Kostin, 
     Ural'skaja ul., 1, 118.
     107241  Moscow, Russia     
     Voice: (+7 095) 462-3260 (It is in Moscow.)
     E-mail: cyril@chat.ru,
             cyril2@mail.ru, 
             cyril@aha.ru


Kostin                        Standards Track                   [Page 6]

RFC NNNN    Character Properties For Character References     April 2002
Expiration date: October 2002.

________________________________________________________________________
Full Copyright Statement

      Copyright (C) The Internet Society (2002).  All Rights Reserved.

      This document and translations of it may be copied and furnished
      to others, and derivative works that comment on or otherwise
      explain it or assist in its implementation may be prepared,
      copied, published and distributed, in whole or in part, without
      restriction of any kind, provided that the above copyright notice
      and this paragraph are included on all such copies and derivative
      works.  However, this document itself may not be modified in any
      way, such as by removing the copyright notice or references to the
      Internet Society or other Internet organizations, except as needed
      for the purpose of developing Internet standards in which case the
      procedures for copyrights defined in the Internet Standards
      process must be followed, or as required to translate it into
      languages other than English.

      The limited permissions granted above are perpetual and will not
      be revoked by the Internet Society or its successors or assigns.

      This document and the information contained herein is provided on
      an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET
      ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR
      IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
      THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
      WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.


Kostin                        Standards Track                   [Page 7]