Internet DRAFT - draft-klensin-encoded-word-type-u

draft-klensin-encoded-word-type-u






Network Working Group                                         J. Klensin
Internet-Draft                                         November 24, 2011
Updates: 2047, 2231 (if approved)
Expires: May 27, 2012


              The "U" Encoding for Encoded-Words in Email
                 draft-klensin-encoded-word-type-u-00

Abstract

   The "Encoded Word" conventions have been used extensively in email
   headers and elsewhere to permit the encoding of non-ASCII characters
   where only ASCII ones are normally permitted.  The existing
   specification defines only two kinds of encoding, one of which cannot
   be understood easily by people and the other of which has been widely
   discredited.  This document specifies a third encoding that is easily
   accessible by users and much more closely tied to contemporary
   practices.

   The current version of the proposal is intended for possible
   discussion in the EAI, IRI, and PRECIS WGs to see if it sheds light
   on other issues being discussed in those WGs.  It is not, at this
   point, proposed for adoption.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on May 27, 2012.

Copyright Notice

   Copyright (c) 2011 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal



Klensin                   Expires May 27, 2012                  [Page 1]

Internet-Draft          Encoded-Words: U Encoding          November 2011


   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . . . 3
     1.1.  Updated Specifications  . . . . . . . . . . . . . . . . . . 3
     1.2.  Terminology . . . . . . . . . . . . . . . . . . . . . . . . 4
     1.3.  Scope and Discussion List . . . . . . . . . . . . . . . . . 4
   2.  Specification . . . . . . . . . . . . . . . . . . . . . . . . . 4
   3.  Security Considerations . . . . . . . . . . . . . . . . . . . . 4
   4.  IANA Considerations . . . . . . . . . . . . . . . . . . . . . . 5
   5.  References  . . . . . . . . . . . . . . . . . . . . . . . . . . 5
     5.1.  Normative References  . . . . . . . . . . . . . . . . . . . 5
     5.2.  Informative References  . . . . . . . . . . . . . . . . . . 5
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . . . 6




























Klensin                   Expires May 27, 2012                  [Page 2]

Internet-Draft          Encoded-Words: U Encoding          November 2011


1.  Introduction

   The "Encoded Word" conventions [RFC2047] have been used extensively
   in email headers and elsewhere to permit the encoding of non-ASCII
   characters where only ASCII ones are normally permitted.  That
   existing encoded-word specification defines only two kinds of
   encoding, one of which cannot be understood easily by people ("B",
   the MIME "Base64" encoding) and the other of which ("Q", so-called
   Quoted Printable) has been widely discredited.  This document
   specifies a third encoding, based on the "\u'NNNN'" convention, that
   is easily accessible by users and much more closely tied to
   contemporary practices.

   Unlike the "B" and "Q" encodings, which were specified at a time when
   many coded character sets were in common use, it is now appropriate
   [RFC5198] to tie a new encoding specifically to Unicode [Unicode] and
   the corresponding ISO Standard [ISO10646], viewing conversion to
   local character sets, if necessary at all, to be a local matter.
   Consequently, this specification permits only the combination "=?iso-
   10646-UCS-4?u?".

   [[anchor2: Note in Draft: If we were really going to do this, it
   would make sense to define a charset that would actually reflect
   Unicode code points, not some encoding of them.  Neither of the
   currently-registered "iso-10646-UCS-4" nor "UTF-32" and its
   variations are quite right for that purpose.  Cf.
   http://www.iana.org/assignments/character-sets]]

   If adopted, it is intended not only as an alternative to "Q" and "B",
   but also as an alternative to the %-encoding of Section 2.1 of the
   URI Specification [RFC3986] of UTF-8 [RFC3629] (and other) strings.
   %-encoding was more than adequate for its original purpose of
   encoding eight-bit character sets, notably ISO 8859-1 [ISO8859-1],
   but is problematic for email (especially addresses and fields related
   to them) because "%" has an important historic (and still
   occasionally used) meaning in those contexts and because its use to
   encode already-encoded forms of multi-octet character sets, such as
   UTF-8 and Unicode, creates strings that are at least as difficult for
   end users to interpret as Base64.

1.1.  Updated Specifications

   This document, if approved, updates the Encoded-Word specification
   [RFC2047] and the specification for the use encoded-words with
   language information [RFC2231] to permit use of an additional
   encoding type, type "U".





Klensin                   Expires May 27, 2012                  [Page 3]

Internet-Draft          Encoded-Words: U Encoding          November 2011


1.2.  Terminology

   Some reasonable understanding of Encoded-Words and the Quoted-
   Printable, Base64, and %-encoding conventions are required to
   understand this introductory material but not the proposal itself.

   The key words "MUST", "MUST NOT", "SHOULD", "SHOULD NOT", and "MAY"
   in this document are to be interpreted as defined in RFC 2119
   [RFC2119].

1.3.  Scope and Discussion List

   RFC Editor: In the unlikely event that you see this subsection, it
   should be removed before publication.

   The current version of the proposal is intended for possible
   discussion in the EAI, IRI, and PRECIS WGs to see if it sheds light
   on other issues being discussed in those WGs.  If discussions are of
   interest, they should occur on the mailing lists associated with
   those groups.

   This Internet Draft is, at this point, intended only to promote
   discussion of a possibly-useful building block for other work.  It is
   not proposed for adoption by the IETF for any purpose.


2.  Specification

   A new encoding form for encoded words is defined with code "u".  The
   associated encoded-text string is consistent with the rules in
   Section 4 of RFC 2047, i.e., it consists of ASCII characters with
   space, tab, and "?" characters excluded.  Non-ASCII characters are
   encoded using the \u'NNNN' form, where "NNNN" consists of four to six
   hexadecimal digits designating a Unicode (ISO 10646) code point.
   That encoding convention is defined in RFC 5137 [RFC5137] together
   with an explanation of why the quotes should be required.

   As an example, the German equivalent of the string "This is nuts",
   would appear in the extended form of RFC 2231 (updated by verified
   Erratum 478 [RFC2231-Err478]) as
   =?iso-10646-UCS-4+de?u?Das ist verr\u'00FC'ckt?=


3.  Security Considerations

   This specification does not raise any security issues that are not
   already present in RFC 2047 and its various updates.  Because the
   coding is more transparent to the end user than any of Base64, Quoted



Klensin                   Expires May 27, 2012                  [Page 4]

Internet-Draft          Encoded-Words: U Encoding          November 2011


   Printable for non-ASCII text, or %-encoding of UTF-8, it may
   eliminate or reduce one possible attack vector that is present with
   those other approaches.


4.  IANA Considerations

   [[anchor9: RFC Editor: Please remove this section.]]
   Because there does not appear to be a registry for either encoded-
   word encodings or the content-transfer-encodings on which they are
   based, this document requires no actions by the IANA.


5.  References

5.1.  Normative References

   [RFC2047]  Moore, K., "MIME (Multipurpose Internet Mail Extensions)
              Part Three: Message Header Extensions for Non-ASCII Text",
              RFC 2047, November 1996.

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2231]  Freed, N. and K. Moore, "MIME Parameter Value and Encoded
              Word Extensions:
              Character Sets, Languages, and Continuations", RFC 2231,
              November 1997.

   [RFC2231-Err478]
              Stedfast, J., "MIME Parameter Value and Encoded Word
              Extensions: Character Sets, Languages, and Continuations,
              Erratum 478", November 2001,
              <http://www.rfc-editor.org./errata_search.php?eid=478>.

   [Unicode]  The Unicode Consortium.  The Unicode Standard, Version
              6.0.0, defined by:, "The Unicode Standard, Version 6.0.0",
              Mountain View, CA: The Unicode Consortium, 2011. ISBN 978-
              1-936213-01-6, 2011,
              <http://www.unicode.org/versions/Unicode6.0.0/>.

5.2.  Informative References

   [ISO10646]
              International Organization for Standardization,
              "Information Technology - Universal Multiple-octet coded
              Character Set (UCS)", ISO Standard 10646:2011, March 2011.




Klensin                   Expires May 27, 2012                  [Page 5]

Internet-Draft          Encoded-Words: U Encoding          November 2011


   [ISO8859-1]
              International Organization for Standardization,
              "Information technology - 8-bit single byte coded graphic
              - character sets - Part 1: Latin alphabet No. 1",
              ISO Standard 8859-1:1998, 1998.

   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, November 2003.

   [RFC3986]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
              Resource Identifier (URI): Generic Syntax", STD 66,
              RFC 3986, January 2005.

   [RFC5137]  Klensin, J., "ASCII Escaping of Unicode Characters",
              BCP 137, RFC 5137, February 2008.

   [RFC5198]  Klensin, J. and M. Padlipsky, "Unicode Format for Network
              Interchange", RFC 5198, March 2008.


Author's Address

   John C Klensin
   1770 Massachusetts Ave, #322
   Cambridge, MA  02140
   USA

   Phone: +1 617 491 5735
   Email: john-ietf@jck.com






















Klensin                   Expires May 27, 2012                  [Page 6]