Expires 10/4/2001 International Language Bridge (ILB) For Mark Felton Implementing Language Free Services draft-felton-universal-language-00.txt 1.1 Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026 Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited. 1.2 Copyright Notice Temporary Copyright (C) Mark Felton (2001). All Rights Reserved // Future Copyright (C) The Internet Society (2001). All Rights Reserved. Mark Felton markf@scicom.alphacdc.com 1.3 Table Of Contents Abstract Terminology UL Components Unicode (Universal Code) Fixed Components (FC) Type Designator (TD) - Type Element (TE) Variable Meaning Function (VM_function) Language Forcing (LF) Linking Element (LE) User Defined Components (UDC) Basic Fixed Components Pictorial Representations (PR) Syntax versus Content Syntax Component UL Syntax Position Dependent Syntax (PDS) User Defined Syntax (UDS) Degree of Specificity Multiple Vocabularies Basic Translators (BT) and Filter Translators (FT) Some Initial Benefits of UL Steps in Creating UL Short and Long Term Goals For UL Fonts & Sounds UL Type Dictionaries (UTD) Host Language Interface Multi-lingual Vocabulary Issues Unicode Parsing UL to HL Parsing HL to UL Parsing Base Object Definitions UML Analysis A Simple Hello Example 1.4 Abstract The existence of language and culture, creates an enormous blockage for the World Wide Web. While we communicate well internationally using pictures, we have limited communications when we attempt to build semantic bridges. In general, our approach to this has been to try to force the other guy to learn our language. This has had only limited success. Many people, quite rightly, resent a vision of American as the only true universal language. In many ways we find science fiction images, like the Star Trek, universal language decoder, much more appealing. This is because we know intuitively that language is a central part of culture and should be maintained, not destroyed. Various solutions have been offered. Probably the most frequently discussed is the concept of translators. In this image of reality, a black box device takes one language and translates it into the other language. This may include a variety of methods including: * text (L1) to text (L2) * speech (L1) to text (L2) * text (L1) to speech (L2) * speech (L1) to speech (L2) There are also intermediary plans * text (LI) to text (L2) to speech (L2) * speech (L1) to text (L1) to text (L2) * speech (L1) to text (L1) to text (L2) to speech (L2) These plans all have one thing in common. The idea that bi-directional translation can occur between the various languages. Unfortunately, this is not often the case. Many concepts in one language do not translate easily to another language. In addition the syntax of one language may result in poorly formulated sentence when translated across the language barrier. The most success in this area will result when a single human being is used as an intermediary between the two language participants. This translator possesses knowledge of both languages. They use their natural linguistic skills to provide approximations of the meaning in the two languages. Language translation also requires sophisticated knowledge of the two languages. The translator is often fluent in one language (their native language) and semi fluent in the second language. This means that the translation in one direction will have more quality than in the reverse direction. For example, I have more vocabulary available when I translate to English than when I translate from English to one of my secondary languages. Is there an alternative solution. In this proposal, it is suggested that an intermediary language is needed to support real success in international linguistic communications. The present plan requires our translators to go from English to German; English to Chinese; English to Japanese; English to Vietnamese; then German to English; Chinese to English; etc. If we take the number of languages in the world and then assume each will talk to each other language, the formula for the number of required translators is: N * (N-1) where is N is the number of languages. With ten languages, this comes to 90 translators. With 100 languages, this comes to 9900 translators. How can this be avoided? The solution is to provide a universal language (UL) which acts as an intermediary between each language. Each language must provide a translation to and from the UL. This means that for each language there are two translators, one from UL and one to UL. With 100 languages the requirement is 200 translators rather than 9900. The use of a UL has a number of immediate advantages. First, the UL can be constructed with a limited number of concepts. Rather than providing all the nuances of every language, the UL is restricted to a limited subset of concepts. These can grow as new needs emerge. The UL also minimizes the requirements for knowing who is going to receive a message. An item can be posted in UL on the World Wide Web. The browser is handed the job of translation from UL to the users Host Language (HL). The UL to HL translator can be provided2 as a plug-in, java applet or built in capability on the browser. Email has similar advantages. A person wishing to send important information between multiple international facilities can do so without concern about recipient language base. Rather than one translation for each language that will receive the message, a single translation to UL will suffice. 1.5 TERMINOLOGY 1. BE - Base Element(s) - This is the Unicode3 along with definitions that make up the UL. The Base Elements can include Position Dependent Syntax (PDS) in UL. The Standard Base Elements are the UL used by all users. SBE does not include Meaning Components for User Defined Components. Only the UDC place holders are included. 2. FC - Fixed Component(s) - A Unicode component in the UL that has a fixed meaning across languages. 3. G-UL - Generic Universal Language - The base of the UL. The G-UL is distributed to all users along with translations to and from all known HL. 4. HL - Host Language(s) - The native language which the UL will translate to and from. Host languages are normally our native languages (English, French, Chinese, Japanese). However, there is nothing that restricts a Host Language. It would be possible to add a made up host language, such as Piglatin. 5. ILB - International Language Bridge - The name used to designate the collection of all concepts, requirements and other factors relating to the creation, dissemination and use of a Universal Language. 6. Jumbo - A collection of multiple UL Unicode that is used frequently enough by a group that it justifies a UDC as a short cut. 7. LE - Linking Element(s) - Used to force a connection of a TD to a FC. This is needed only where there may be confusion about which FC is being modified by the TD. LE should rarely be needed. This is because the UL syntax will normally identify the required logic. 8. LF - Language Forcing - A method to inline a specific HL. A LF will force the translator to use a specific language whether it is or is not the users HL. 9. MC - Meaning Component(s) - This is the global meaning that must be conveyed by a UL Group. 10. MV - Multiple Vocabularies - A method of providing Core Vocabulary along with Secondary Vocabularies (SV) to support quicker downloading and end user specialization. The CV is distributed to all users. The SV are provided only as needed. MV may have overlap, i.e. the same Unicode with the same MC may appear in more than one MV. However, when there is overlap, the Meaning Component (MC) must be the same where the Unicode is the same. This differs from UDC. 11. PR - Pictorial Representation(s) - A method of creating visual clues to the meaning of a UL Unicode. 12. SBE - Standard Base Element(s) - UL without UDC Meaning Components. 13. SC - Syntax Component(s) - A UL-G that is used to convey syntactic meaning, e.g. this is a question. 14. TD - Type Designator(s) - A Unicode component in the UL that is always followed by a variable field. TD are used for specific information with a range, e.g. tall or short; amount of money; distance. 15. TE - Type Element(s) - Associated with a TD. The TE is the variable quantity relative to the specific TD. Multiple TE can be associated with a single TD. 16. UDC - User Defined Component(s) - The UDC allows a group of users to customize the UL. This allows the core UL to be restricted. A Generic UL provides the base construct. The UDC adds the specialization. UDC are provided for each of the types available in UL. 17. UL - Universal Language - The general language used as an intermediary between all Host Languages. 18. UL-G - Universal Language Group(s) - A group of one or more UL Unicode that builds a Meaning Component. Many UL-Gs require only a single UL. A UL-G that is frequently used and which is created from multiple may become a Jumbo Group. When this happens, it is a prime candidate for grouping into a UDC. 19. UL-Phrase - A UL Phrase is a group of UL-G that combine to create enough meaning to support all HL translators. Depending on the context, a UL-Phrase may be a little as a single Unicode or made up of a string of UL-Gs. 20. Unicode - Universal Code - A double byte code that already exists in languages such as Java. Unicode is provided to create a universal method that can provide any language, native or other on the world wide web and in programs. Unicode is a recognized standard. 21. VM_function - Variable Meaning Function(s) - A VM function is used in conjunction with a Type Designator. It provides a HL specific way to create the variable meanings from the range of Unicode. 1.6 UL Components Unicode (Universal Code) The Unicode4 standard allows 16 bit binary code to represent a language. A language range in the Unicode standard is used to signify which language is being represented. There are numerous expansion ranges in the Unicode standard. UL can easily be fitted into this expansion. For the purpose of this document Unicode references are shown as xx##, e.g. xx01 or xx07. The xx represents the language range identifier which is not presently specified for UL. The ## represents specific numbers used to identify a specific member in the UL Unicode5. 1.7 Fixed Components (FC) A fixed component is one which has the same Meaning Component (MC) in all contexts. A meaning component is not tied to a language. For example, "food" is a meaning component. It translated differently into different languages, but it has the same meaning across the languages. While it may seem logical to equate Meaning Component with nouns in the English language. This is not necessarily the case. For example "send an email" might be an useful MC. Other examples are: * call me by phone * available for a conference call An FC should be broad. Specificity must be defined though UDC. So "computer" might make sense as an FC, "Dell Computer" would not. Certain parts of language are not needed in UL. For example words like "the, a or an" are not needed. These must be added by the HL translator. There are some FCs that are needed but should only be available in a single form. For example negation uses multiple words in many languages. The rules for UL must be clear and universal with respect to negation. Questions is another area where languages may vary significantly. A question is created with a Syntax Component (SC). 1.8 Type Designator (TD) - Type Element (TE) A type designator is used to signal that the next item is of a particular type. Typical type designators might include: * Money/Quantity * Size/Unit * Emotional Component * Proper Noun * Address * Phone Number * Temporal/ Past Present Future A Type Designator is followed by a piece of information called the Type Element. The TE may vary across an extremely large range. For example peoples names may vary enormously. The UL to HL translator must provide a method of translating the Variable Information (VI) that follows the TD. For some instances, this is extremely simple. For example a unit conversion from HL dollars to UL monetary units could exist for each language and monetary unit. The same is true of phone numbers that could translate from English numbers to UL numbers. A Chinese client would receive the UL numbers and translate them into Chinese numbers. Each type designator will also include a NULL value option. This will be used to indicate that value is either unavailable in the HL that created the content, or the value is unspecified in this context. The UL to HL translator will provide a reasonable translation for the NULL value. It is definitely worth noting that UL provides a special TE to support verb tense or temporal relationship. When the temporal TE is used, there may be two or more TE used in series. This allows verb conjugation to be handled with minimal UL Unicode. A verb (TD) might have two (or more) TE associated with it. This allows "run, ran, will run, has run, etc." to all be done with a TD-TE. In addition a second TE (TD-TE-TE) will change this from above to "walk, walked, has walked, will walk, etc.". The generic TD is state of motion of a person. The conjugation TE changes the temporal aspects, while the relative TE changes the state from run to walk. With this strategy, numerous English language words are produced with only three Unicode. This should also be true for other languages. The critical part of UL becomes the identification of the fundamental Meaning Components (MC). 1.9 Variable Meaning Function (VM_function) A Variable Meaning Function is a language based rule for adding the variable meaning associated with the TE part of the TD-TE pair. Typical examples of VM_function(s) are: * Numbers - a method of producing a Host Language (HL) specific interpretation of the numeric values from the TE. * Direction - a method of producing a HL specific interpretation of direction, e.g. north, south, up, down, etc. * Emotion - a method of producing a HL specific interpretation of emotional state, e.g. good, bad, terrific, etc. * Temperature - a method of producing a HL specific interpretation of temperature, e.g. hot, cold, freezing, etc. 2.0 Language Forcing (LF) In some instances language overriding might be needed. For example, a monetary web sight might want all transactions represented in dollars even when the surrounding information is translated. 2.1 Linking Element (LE) A linking element is used to connect a type designator (TD) to a fixed element (FE). LE are used when confusion can result as to which FE is being modified. It is unclear at this time if Linking Elements are really needed. A properly designed language syntax may remove the need for LE. 2.2 User Defined Components (UDC) User defined components are placed in a local vocabulary located on the client. For example, Gates Rubber might use the word "rubber" frequently during emails, while Coca Cola might use "beverage". The UDC are defined prior to communications. In the case of a Web page, it would be the job of the Web page provider to produce a list of UDC used on the Web page. These could be downloadable either prior to access or during access. The client would announce it language to the server so the correct subset of UDC would be provided. A UDC may be a full concept rather than a single word. Return to Coca Cola as a potential user, the concept "sell Coca Cola" could be a single UDC. With several fixed components it would be possible to get Do we have a good ad to sell Coca Cola Can you sell Coca Cola in China? What do we need to do to increase our ability to sell Coca Cola? 2.3 Basic Fixed Components The basic fixed components must be selected to provide the most common content used in email and Web pages. We begin by providing general categories that should be universal across languages. * Objects - Stationary things in our world. These are in general what the English language calls nouns, but they are not all nouns. For example in the sentence "The love that I feel.", the word "love" is a noun, but it is not an object. An le rule would be all type designators must precede the basic element they modify. On a computer screen this would allow large ball to map out a space prior to placing the ball on the screen. The same is true for right ball (ball to the right). The difficulty arises with red ball. In this instance the ball needs to be placed on the screen before the red is added. It is unclear whether link elements should be used here or store and wait logic in the language translators. 2.6 Syntax Component In UL, a Syntax Component (SC) is a Unicode used to convey syntactical meaning is a statement. The most clearly defined SC is the question component. An SC->Question will make a UL statement into a question. The location of SCs will be a defined part of the UL language. UL will not contain some of the normal syntax found in languages. For example, there will not be commas. A period SC may be needed to identify completion of a UL sentence. This is an open issue. Since UL can be embedded in other languages, e.g. HTML, it will allow for syntax transitions from outside the UL language. For example, an HTML table of UL information could be sent via email. 2.7 UL Syntax The language of UL will have strict syntax rules. The basic unit is Unicode. A group of Unicode will produce a phrase. Each phrase creates a phrase meaning, i.e. it should provide enough information to create an acceptable statement in any HL. The set of all phrases creates a message. It is also possible to have multiple messages provided the messages are embedded in another language, e.g. multiple messages in UL embedded in HTML. 2.8 Position Dependent Syntax (PDS) Position Dependent Syntax (PDS) allows UL to use the same Unicode to produce different Meaning Components. This occurs when the Unicode is a TD-TE combination. For example the same Unicode can be used for degree of heat and cold as are used for amount of money. In the former case there is a temperature TD while in the later there is a money TD. They are both followed by a Unicode with a possible range of values. TD-TE elements will be created to provide a reasonable range of values. Where more resolution is needed, the User Definable Components (UDC) will be used. Linking Elements can also provide support for these specialized applications. 2.9 User Defined Syntax (UDS) The availability of User Definable Components (UDC) allow a specialized need to define a syntax independent of UL. These applications will be allowed but will not be supported by the UL development team. 3.0 Degree of Specificity Languages are subtle by nature. They allow us to express things in a rich variety of ways. As stated earlier, the UL should be simple, forcing the translators to add richness to the translations. As a simple example, "run, walk and stand" are three different words in the English language. But in UL they could be expressed as three states of a person in motions. The first is Motion + large magnitude. The second is Motion + minimum magnitude. The third is motion plus zero magnitude. If we add to this a Unicode for road, a number of sentences can be created. * walking on the road * running on the road * standing on the road * on the road * moving down the road Of course, if moving on roads was a common part of a groups communications, it would be possible to create a User Defined Component (UDC) to represent movement on a road. This brings out an important point. A UDC can be created by transferring a language specific definition to the end users or it can be created by combining identifiable UL components. In the latter case, the UDC is referred to as a Jumbo UDC. Case 1: UDC#1 => HL#1 -> "text in host language #1 here" : UDC#1 => HL#2 -> "text in host language #2 here" Case 2: UDC#2 => UL -> xx01 xx03 xx17 ... (sequence of UL codes produce a jumbo) 3.1 Multiple Vocabularies UL should support Multiple Vocabularies (MV). Rather than having every UL to HL translator support every possible UL Unicode, it should be possible to define sub-vocabularies that are application specific. For example, there could be a set of highly common codes. Then there could be a second set of business codes and a third set of sports codes. A person wishing to send an email about business would load the common codes and the business codes. For sports discussions load the common codes and sports codes. For a business that provides sport equipment, load all three. When possible, which codes to load can be sent along with the message. . MV will consist of a Core Vocabulary (CV) and many Secondary Vocabularies (SV). There is nothing to stop the MV from having the same UL-G. However, where there is overlap between MV, the Meaning Component (MC) must be the same for overlapping elements. Only UDC (User Defined Components) can have a different Meaning Component for the same Unicode. UDC are not a part of the Multiple Vocabularies. They are an independent area of the UL that is fixed in size and available at all times to the users and translators 3.2 Basic Translators (BT) and Filter Translators (FT) A Basic Translator (BT) takes content to or from a HL (Host Language) and translates to or from UL. A translator will have no knowledge of embedding in other languages. A Filter Translator (FT) will include knowledge of other languages in addition to UL. Some possible FT are: * EMAIL FT - Leaves in place email headers needed for transfer. Replaces other content either UL-HL for HL-UL. * HTML FT - Leaves in place HTML identifiers. Replaces other content either UL-HL or HL-UL. * VIDEO TEXT FT - Text output available along with TV is now common practice. It is used for deaf people, elderly people or to support noisy environments (e.g. aerobics classes). Using UL, it would be possible to support language independence. The Video Text would be available in multiple languages. * MOVIE TEXT FT - Movies require translators to provide subtitles in different languages. A Movie Text FT could allow a single HL, e.g. English, to be translated through UL to other languages. This would allow wider movie distribution around the world. With a properly created translator (TR) a person should be able to take an email in the HL, put it through the HL to UL email filter, then send it to a foreign recipient. The far end recipient would then use a UL-HL Filter Translator (FT) to go from UL to their native host language. 3.3 Some Initial Benefits of UL * With UL a multi-language Web site can be provide with a single UL site. * UL provides a more user friendly interface for international email. * With UL it is possible to take an existing Web site in any HL and run it through an HL for language #1 to UL filter . It can then be passed through a second UL to HL for language #2. The Web page will then appear in the users native language. * With UL it is possible to take an existing email in any HL and run it through an HL for language #1 to UL filter. It can then be passed through a second UL to HL for language #2. The email will then appear in the users native language. 3.4 Steps in Creating UL 1. Identification of high level objects - This is being worked in the present document. 2. Identification of HL language expectations - It may not be possible to do all languages for a prototype. It should be possible to do a subset. This subset should include languages from each of the major continents. For example, a good subset might be English, Chinese, Japanese, Russian, Spanish, Arabic. It is important that the initial work not focus on European only languages because they have a common syntactic structure, use the Arabic alphabet and have numerous other commonalties. 3. Architecture of the UL language. This will require a team of linguistic and computer experts. The goal will be to conceptualize the needed structure to provide a common international base. The architectural analysis will also try to identify the best locus for initial prototyping. For example should it be browser based, server based, integrated into HTTP? Should it be developed as a Java API or as a browser plug-in. Will it initially be a runtime component or less interactive? These and other questions can be worked by the software segment of the architecture team. The linguistic team will focus on aspects specific to the language. How will negation operate? Where will a question operator reside? What constitutes a Meaning Concept across the various languages? How will base and filter translators operate? What is the core UL? What types of vocabularies should be supported? 4. Identification of the Base Elements (BE). The basic core language will need to be identified. Next there will be a need to prototype it and determine how well it translates between different languages. While the original work can be done with a limited expectation, at some point there will be a need to determine how easily it expands. Finally the specialization stages using UDC needs to be tested. 3.5 Short and Long Term Goals For UL Here are some of the possible short and long term uses for a Universal Language: 1. Email - The use of email represents the most frequent method of communications on the Internet. The number of emails sent daily has multiplied at an astronomical rate. As global markets continue to grow a method of rapid translation of email content will greatly expand our global commitments. 2. Web Pages - HTML content has now become an enormous asset for providing information to the public. The ability to provide multilingual content through a single translation mechanism using UL will greatly expand the dissemination of content. This can have important advantages for world wide global responsibility. It can mean that international programs in space exploration, telecommunications, medicine, etc. can share content without language constraints. 3. Firewall Content - Many multinational companies now provide information behind the safety of firewalls. Since they often use standard mechanisms available on the World Wide Web, the use of UL should be invaluable to their private international networking. 4. Other Document Format - The basic mechanisms provided by UL should be expandable to other document formats. These can be public format, e.g. PDF and Word or more private formats such as Framemaker. The UL translations mechanisms would work equally well with attachable documents as with standard interchange mechanisms. 5. Computer Languages - UL can provide significant gains for computer languages attempting to build language independence. UL is built on Unicode. This code has been developed to provide multiple language capability. Java is one example of a language that has done extensive work on Unicode usage. 6. Intermediary For Text to Speech and Speech to Text - While UL is a Unicode textual language, it can be used as an intermediary for the transition from one language to another language in any format. This includes speech, Braille or any other media used to communicate language content. As the state of the art continues to grow, UL could potential serve as the mechanism for real time audio translation. This means it could be use in numerous disciplines including the replacement of movie subtitles, translation of public speeches, etc. While this is clearly a long term goal, the objective is not unreasonable. Setting the stage for future growth in language translation will clearly serve many potential growth areas. 3.6 Fonts & Sounds The Universal Language has no external representation. It exists only as Unicode. Unlike other languages that require character representation, e.g. Kanji, UL is an intermediate language. UL has no sound representation. Again this differs from other languages where pronunciation and confusion due to accents is a critical concern. The fact that UL exists only as an intermediate language has significant advantages. Translation rules need only apply to meaning. By providing strict syntactic rules for UL, translations will require minimal risk of misinterpretation. The primary difficulty will be in determining the linkage from UL to the Native Languages. 3.7 UL Type Dictionaries (UTD) Each of the Unicode types in UL are grouped together into a UL type Dictionary. The UTD required to support UL are: * UTD.FC - The Fixed Component dictionary consisting of all Fixed Component objects in UL. * UTD.TD - The Type Designator dictionary consisting of all Type Designator objects in UL * UTD.LE - The Linking Element dictionary consisting of all Linking Element objects in UL. * UTD.UDC - The User Defined Component dictionary consisting of all User Defined Component objects in UL. * UTD.UL-G - The Universal Language Group dictionary consisting of all Universal Language Groups objects in UL. * UTD.LF - The Language Forcing dictionary consisting of all Language Forcing objects in UL. * UTD.SC - The Syntax Component dictionary consisting of all Syntax Component objects in UL. * UTD.Jumbo - The Jumbo dictionary consisting of all Jumbo objects in UL. 3.8 Host Language Interface The Unicode interpretation of Host Languages will be used to provide a common language base. For example the English equivalent of "cow" is Unicode XX XX XX. The same meaning in Chinese is "xiahu" which is Unicode XX XX. Note that the English cow takes three Unicode while the Chinese requires only two. While UL requires only one. All three require the Unicode fall in the range for the specified language, i.e. HL or UL. 3.9 Multi-lingual Vocabulary Issues A serious issue for UL is the amount of vocabulary required and the nature of Host Language vocabulary. The number of words in the English language is enormous. Other languages share these large sizes. While much of the vocabulary is common across the languages, e.g. many nouns, there are places where vocabulary will result in high degrees of complexity. Some examples of problem areas are: * Words which have multiple meanings in one language but only a single meaning in a second language. For example, "HOT" is an English word for a temperature state and in food for a spicy state. In Spanish, these are two separate words. * Words which may have other nuances in one language that do not exist in another language. In Chinese, there are different words for "first son versus second or greater son". The "first son" has special responsibilities that do no exist in other cultures. Other relatives are also viewed differently than among European cultures. * Words which are taken from other languages. The word "email" is used by other cultures. However, it is frequently pronounced using the characteristics of the host language. A number of tools are available in UL to solve these issues. These include: * UDC - the User Defined Vocabulary - Allowing areas where languages significantly diverge to be covered on a case by case basis. * LF - the use of language forcing - Allowing a HL to be forced into use, e.g. Sputnik. * TD-TE - allowing relative aspects of a language to be included, e.g. the various Chinese relationships could include TE values. A more complex Host Language output would be needed in English to fully explain what is meant. 4.0 Unicode Parsing The primary types of Parsing are, UL to HL and HL to UL. These needed to be examined separately. 4.0.1 UL to HL Parsing To parse from UL to HL requires a two stage process. Each UL segment is separated into UL Phrases. A single UL phrase is grabbed as a complete unit. This may consist of N1 Unicode of N Unicode. The UL phrase is then Content Parsed (CP) into Meaning Components. This is done by reading in one Unicode at a time and doing the required translation to the Host Language. In some instances this may require more than one Unicode to be processed to provide the full meaning. At this point, the translated message will consist of a Content Message (CM) in the Host Language. The Content Message may still require additional words or re-arranging in order to make it a well worded and syntactically correct phrase in the Host Language. This is accomplished in the Phrase Parsing (PP). During this stage the articles or other subtleties of the language are added. The Host Language Phrase can now be output to the end user. The following example from Chinese illustrates the need for Phrase Parsing. In English the meaning content one book and one person would be easily realized with two Fixed Component (book and person) and a TD-TE pair for numbers. In Chinese a connector word is required for any object (noun). It is used between the number and the noun. The counter kanji character is different for book as opposed to person. The correct connector requires both knowledge that a counter is being used and the type of object being counted. This would need to be Unicode_L1[] L1_code_segment; int L1_length; Unicode_L2[] L2_code_segment; int L2length; // ... etc. Unicode UL_code_segment; } class Fixed_Component extends Meaning_Component { /* * A Fixed Component is a static base element in the Universal Language * Fixed Components have the same meaning for any UL messages */ boolean set_by_language (language); // must first verify that // language is not already set } class UTD_Fixed_Component { add( Fixed_Component ); boolean sanity(); // Verify fixed component // dictionary is complete // for release # static Fixed_Component[] // reference to a fixed components list } interface Variable_Meaning { /* Variable Meaning Functions */ VM_L1(); VM_L2(); } /* * Type Designators Must be base classes to provide the Variable Meaning functions * This is the reason for the abstract function definitions. */ class Type_Designator extends Meaning_Component implements Variable_Meaning { /* * A Type Designator is a static base element in the Universal Language * Type Designators always have a variable quantity Unicode after them. */ /* Variable Meaning Functions */ abstract VM_L1(); abstract VM_L2(); // ... Unicode variable_value; } class Type_Element extends Meaning_Component { } interface Host_Language { byte English 0; byte French 1; byte Spanish 2; byte German 3; byte Chinese 4; byte Vietnamese 5; byte Japanese 6; byte Hindi 7 byte Russian byte Cherokee byte Arabic byte Greek byte Hebrew Bengali Canadian_aboringinal Korean Ethiopic Gujarati Gurmukhi Cyrillic Mongolian Romanian Serbian Georgian Thai Tibetan Ogonek // ... } UML Analysis PARSER HL-UL PARSER UL-HL PHRASE PARSER HL-pre to HL-post TBD A Simple Hello Example UL - xx00 Meaning Content - Greeting HL_English 0048 0045 004C 004C 004F ASCII Equivalent: HELLO HL_Spanish Tonal/ASCII6 Equivalent: Hola HL_Chinese (Unicode Chinese) ‎ƒ¦بأ7 Tonal Equivalent8: Ni Hao Mah 1 Translation normally is symmetrical. Any translation from L1 to L2 will also require a reverse translation from L2 to L1. 2 Throughout this document suggestions are made for implementation. These are only meant to be suggestive. At some point an architectural decision will be needed. 3 Unicode Standard is listed through ISO 10646. Information on Unicode is available at http://www.unicode.org 4 The definitions of UL types in this document use a suggest a single Unicode will be used to represent a single Meaning Component. Investigation will be required to determine if the quantity of Meaning Components will require two or more Unicode to adequately represent language requirements. 5 While this document suggests that UL should be added to the Unicode standard, it is not an absolute necessity. UL is an intermediate language. It drives no character set and has no visual or textual component. For this reason, it could exist as a purely abstract language, independent of existing standards. This also means that UL can be begun without waiting for standards decisions. 6 Tonal/ASCII equivalent is used to show that Spanish does have a few special characters, e.g. enya 7 These characters will appear in simple Chinese with NJStar Communicator 8 Tonal equivalent is used for Chinese to show the Kanji character set is not used in this document. Expires 10/4/2001