Network Working Group C. Falk Internet Draft Infinite Automata Intended status: Standard May 9, 2011 Expires: November 2011 Tags for the Identification of Transliterated Text draft-falk-transliteration-tags-00.txt Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. This document may not be modified, and derivative works of it may not be created, except to publish it as an RFC and to translate it into languages other than English. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This Internet-Draft will expire on November 9, 2011. Falk Expires November 9, 2011 [Page 1] Internet-Draft Transliteration Tags May 2011 Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Abstract This document describes the structure, content, creation, and semantics of language tags for use in describing text that was transliterated from one orthographic system to another. Table of Contents 1. Introduction...................................................2 1.1. Problems Concerning Language Tags.........................2 1.2. Tags for Identifying Languages............................3 2. Transliteration Tags...........................................4 3. Security Considerations........................................4 4. IANA Considerations............................................4 5. Conclusions....................................................4 6. References.....................................................4 6.1. Normative References......................................4 6.2. Informative References....................................5 7. Acknowledgments................................................5 Appendix A. Examples of Transliteration Tags (Informative)........6 1. Introduction 1.1. Problems Concerning Language Tags Language tags are a common tool used in the Internet. Such tags are useful in content localization and machine translation. Many different standards exist for how to represent language information in machine-readable formats. Existing language tags all suffer from the same problem in that they represent only the language and not the orthography used in writing said language. Many languages such as Russian, Chinese, and Arabic have multiple orthographies for written content. A few languages, Falk Expires November 9, 2011 [Page 2] Internet-Draft Transliteration Tags May 2011 including Serbian, are digraphic, which means they are natively written in two or more different scripts. A further complication arises when including the practice of transliteration, or changing orthographies. Most often this is seen when languages written in non-Latin orthographies are rewritten using Latin characters. These orthographies are not mutually intelligible. So to say that two different pieces of text are, "Chinese written in Latin script," is not useful if one is transliterated using the Wade-Giles system while the other is using the Pinyin system. The problems a complete language tag must address are: 1. Identify the content's language. 2. Identify the language's current orthography. 3. Identify the original orthography used if the content was subject to transliteration. 4. Identify the system used in the transliteration, if the current content differs from the original. To date no single language tag standard can address all these problems. 1.2. Tags for Identifying Languages While there are several existing language tag standards only a handful of these standards advance us toward the goal of a complete language tag system. Chief among these is the RFC 5646 document as edited by Phillips and Davis. RFC 5646 satisfies the first two criteria of the proposed complete language tag. First, RFC 5646 it represents the content's language. This is the very first portion of a BCP 47 language tag. If an alpha-2 code belonging to the ISO 639-1 standard is available then that code is used. If no alpha-2 code is available then the longer alpha-3 code belonging to the ISO 639-3 standard is used. Second, RFC 5646 represents the languages current orthography. This is an optional portion of the BCP 47 tag. Language orthography representation is handled by the alpha-4 tags defined in the ISO 15924 standard. What RFC 5646 doesn't address is the last two transliteration- related criteria for a complete language tag. Falk Expires November 9, 2011 [Page 3] Internet-Draft Transliteration Tags May 2011 2. Transliteration Tags While RFC 5646 does have its shortcomings, it provides for future growth and expansion through extension sub-tags. By using these extension sub-tags we can add a second layer of analysis upon the existing RFC 5646 tags to satisfy our transliteration tag criteria. As discussed in section 1.1. , the transliteration tag needs to define two additional pieces of data: 1. Original orthography. 2. The transliteration system used. There will be a new extension tag for each of these pieces of data: 1. The original source orthography will be denoted by the singleton "s" followed by the ISO 15924 for the source script. 2. The transliteration system will be denoted by the singleton "t" followed by a 2-8 character alphanumeric string abbreviation of the transliteration system. 3. Security Considerations The transliteration tag described in this document includes information about the transliteration system used. Some transliteration standards are proprietary, and the information of their use in a public exchange might constitute a breach of privacy. 4. IANA Considerations 5. Conclusions This document shows how, using the extension mechanisms built into the language tag standard of RFC 5646, a more complete way of representing written languages is achieved to include any transliteration performed upon the text. 6. References 6.1. Normative References [1] Phillips, A. and Davis M. (Editors), "Tags for Identifying Languages", BCP 47, RFC 5646, September 2009. [2] International Organization for Standardization, "ISO 639- 1:2002. Codes for the representation of names of languages - Part 1: Alpha-2 code", July 2002. Falk Expires November 9, 2011 [Page 4] Internet-Draft Transliteration Tags May 2011 [3] International Organization for Standardization, "ISO 639- 3:2007. Codes for the representation of names of languages - Part 3: Alpha-3 code for comprehensive coverage of languages", February 2007. [4] International Organization for Standardization, "ISO 15924:2004. Information and documentation -- Codes for the representation of names of scripts", January 2004. 6.2. Informative References [5] Dale, I.R.H., "Digraphia", International Journal of the Sociology of Language 26 (1980) pp. 5-13. [6] Buckwalter, T., "Buckwalter Arabic Transliteration", Qamus, 2002. [7] International Organization of Standardization, "ISO 9:1995. Transliteration of Cyrillic characters into Latin characters - Slavic and non-Slavic languages", 1995. 7. Acknowledgments Thanks to Tim Buckwalter of the University of Maryland for patiently answering questions about his Arabic transliteration system. This document was prepared using 2-Word-v2.0.template.dot. Falk Expires November 9, 2011 [Page 5] Internet-Draft Transliteration Tags May 2011 Appendix A. Examples of Transliteration Tags (Informative) ar-Latn-s-Arab-t-buckwalt (Arabic-language text transliterated from the Arabic script into the Latin script via the Buckwalter transliteration system) ru-Latn-s-Cyrl-t-iso9 (Russian-language text transliterated from the Cyrillic script into the Latin script via the ISO 9 transliteration system) zh-Latn-s-Hans-t-pinyin (Mandarin Chinese-language text transliterated from the simplified Han script into the Latin script via the Pinyin transliteration system) Falk Expires November 9, 2011 [Page 6] Internet-Draft Transliteration Tags May 2011 Authors' Addresses Courtney Falk Infinite Automata Email: court@infiauto.com Falk Expires November 9, 2011 [Page 7]