INTERNET-DRAFT J. Salsman Filename: Cisco Systems submitted to the W3C Voice Browser activity 4 January 1999 Form-based Device Input and Upload in HTML Status of this Memo This draft extends an experimental protocol for the Internet community. This draft does not specify an Internet standard of any kind. Discussion and suggestions for improvement are requested. Distribution of this memo is unlimited. This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." To view the entire list of current Internet-Drafts, please check the "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow Directories on ftp.is.co.za (Africa), ftp.nordu.net (Europe), munnari.oz.au (Pacific Rim), ds.internic.net (US East Coast), or ftp.isi.edu (US West Coast). 1. Abstract Currently, HTML forms allow the producer of the form to request information -- including files of data -- from the operator reading the form. However, this capability is limited because HTML forms don't provide a way to ask the operator to submit input from arbitrary sources such as audio devices like microphones. Since input and upload from various devices is a feature that will benefit many applications, this draft proposes an extension to the HTML INPUT TYPE=FILE form element specified in RFC 1867 to allow information providers to express requests for uploads from audio and other devices uniformly. A discussion of MIME audio data types to facilitate useful audio upload responses follows. Also security discussions, audio usability and quality discussions, and a description of a backward compatibility strategy allowing new user agents to utilize HTML written with earlier proposals for audio input in mind. Motivations, including language instruction assistance, voice transcription, and other applications, conclude. 2. HTML forms with device input file upload submission Section 3.1 of RFC 1867 provides for the presentation of an arbitrary "widget" to specify input for file uploads. When an INPUT tag of type FILE is encountered with a DEVICE attribute, the associated value (such as MICROPHONE, or MIC) might select the use of a widget capable of buffering and editing real-time input (such as speech) instead of entering a file selection mode. If an ACCEPT attribute is present in a device file input element, the browser might constrain the MIME type of uploaded data to match those with the corresponding list of types specified. If the value of the DEVICE parameter is FILESYSTEM or FILES then the INPUT element might be treated as usual according to RFC 1867 except that the subset of files presented to the operator to choose from may be constrained by the specified list of MIME types instead of a pattern of file names or extensions. Since there is no original filename as specified in section 3.3 of RFC 1867 for parameters of the 'content-disposition: form-data' and 'content-disposition: file' HTTP headers, those headers might be provided with a 'type' parameter representing the MIME type of the encoded data, if known, and a 'device' parameter with the same value as the DEVICE attribute of the associated form input element, unless the device or MIME type(s) specified are unsupported in which case the value of the 'device' header parameter might be 'unsupported', or unless the device is unavailable in which case the value might be 'unavailable'. If the MIME types requested are unsupported, an additional parameter 'alternates' might be included with a space-separated list of MIME types of the same content-type which may be supported as alternatives for the specified device. The content-disposition header parameter syntax is described in RFC 1806. There may be significant limitations on the client browser's ability to buffer input for upload. Browsers might provide an estimate of the default MAXLENGTH available for device input and upload through the HTTP header 'Pragma: DEVICE-MAXLENGTH='(bytes) which represents the content-length available to the browser for buffering (see section 14.32 of RFC 2068.) Furthermore, the VALUE attribute may be used to provide a disambiguation between multiple similar devices when present. Under most conditions the operator should be allowed to select the device from ambigous sources of input, or re-select it if specified with a VALUE parameter. If real time events, such as those described and proposed by Gregory S. Aist in "A General Architecture for a Real-Time Discourse Agent and a Case Study in Computerized Oral Reading Tutoring" (Carnegie Mellon University Computational Linguistics Program, 6 December 1996), are required, then the Real-time Transport Protocol (RTP, currently RFC 1889) should be used instead. Because of security concerns discussed in section 3 below, HTML scripts might not be able to invoke a form submission when the form involves any kind of file upload without explicit instructions from the session operator to the contrary. 2.1. Examples
Say something:
In this simple form, the HTML author has requested the upload of sampled microphone input from the operator upon form submission. Here MIC is not used as an abbreviation. The author of the HTML has requested that the data input from the microphone be encoded as either the MIME type Audio/L16 -- sixteen bit signed linear audio samples (most-significant byte first) -- as specified in RFC 1890 section 4.4.8, with a single (monaural) channel and a sample rate of 11,025 samples per second, or an unspecified extended MIME Audio type named 'x-cepstral-voc'. Here the form element may be used to upload a file as usual, except that the files to select from might be constrained to text files, without explicit regard of their filename or extensions. The final example shows how these extensions may be used to request input from other kinds of devices, such as the second of two or more cameras connected to the system running the browser. 3. User interface usability and quality concerns for audio An audio sample is customarily recorded on computer equipment with a dialog routine capable of allowing the user to record, pause, play back, erase, or otherwise edit the recording. Browsers might provide the operator with the same kind of dialog routine for audio device input. And if a MAXLENGTH has been specified or is in force because of limited buffer size, a display of the buffer size used and remaining might be displayed as a dynamic bar graph (or percentage if graphics are unavailable.) A display of time in seconds used and remaining in the buffer may also be provided. Most MIME types defined for audio do not provide high-quality audio encodings. The 'audio/basic' and other types which use a sample rate of 8,000 samples per second truncate the audio spectrum at 4,000 Hz according to the Nyquist theorem, discarding information important for discerning consonants. Also, audio/basic and other MIME Audio types use a sample size of eight bits, which does not usually provide enough dynamic range for accurate automatic speech recognition unless published automatic gain control algorithms are reliably used. If sixteen-bit unsigned audio encodings are used according to section 4.4.8 of RFC 1890, the sample rate -- specified as the 'rate' parameter of the MIME type 'audio/l16' -- might be at least 11,025 or 16,000 to adequately provide sufficient information for automatic speech recognition. Otherwise, the audio feature extraction encoding of the speech recognition algorithm might be used to provide a more compact representation to shorten the upload. 4. Security considerations Browser operators may not want to send their files, recordings, pictures, video, or other device inputs to arbitrary sites without their explicit permission and direction. Therefore, browser authors are encouraged to disallow the submission of forms which include any kind of file upload by any means other than the standard HTML operator-controlled buttons for form submission without explicit instruction from the session operator to the contrary. Accordingly, the SIZE parameter, document style sheets, and document layers may be prevented from obscuring any kind of file upload widget, especially those capable of accepting a default filename. Furthermore, just as the operator may take direct action to initiate, terminate, review and edit recording as described in the previous section, browser authors are encouraged to prevent HTML scripts from taking those and similar actions, unless for example the operator has specifically enabled such script actions with a security option. Even then, such preferences might be specified by the operator to reset after an interval or at the end of the session. Finally, explicit information might be provided to insure that the operator is informed when files are being uploaded. 5. Compatibility with earlier forms of audio input Audio device input has been proposed before and implemented from a microphone at least as early as 1994 in experimental versions of common Web browsers. To accommodate the syntax of these earlier extensions, a browser might interpret a valid XML statement such as as the device input form with all other attribute/value pairs of the original INPUT element kept the same as specified. This would retain compatibility for all implementations of which the author of this draft is aware. 6. HTML Document Type Description changes Along with the extension to the HTML InputType entity described in the previous section, this proposal makes an addition to the HTML DTD for the INPUT element ATTLIST of an #IMPLIED attribute DEVICE of type CDATA. 7. Motivations and conclusion The primary motivation for these extensions is to add the capability of speech input to Web-based educational systems. For example, the "Test of English as a Foreign Language," or TOEFL assessment is comprised of multiple choice questions based on media comprised of text and audio recordings, so it would be possible to represent the TOEFL with current HTML multimedia content and forms. However, the TOEFL makes no provision whatsoever about the accuracy of pronunciation by the subjects of the assessment, except that provided by the ability to accurately identify the terms in the text of the assessment. So while scoring on the important ability to listen, the TOEFL does not make provisions to assess the important ability to speak with correct pronunciation. But with form-based audio input and upload, and speech recognition servers capable of aligning and scoring the pronunciation of words and phonemes, such a Web-based TOEFL could be extended to lessen the number of indiscernible graduate teaching assistants, for example. These possibilities for language instruction are not limited to the graduate level or to the English language. Other motivations include the development of "dictation servers" capable of transforming spoken audio uploaded though an HTTP session to the corresponding text suitable for sending in email or including in another document, for example. Natural language continuous speech recognition software conforming to standard APIs for automatic dictation is as of this writing available from retail outlets for less than US$90 so there is ample reason to believe that dictation servers might soon become commonplace on the Web with these extensions. [Please see "Additional references" 1-3, below.] Furthermore, this could be a great help for hearing impaired people who want to use a "phonology server" (similar to the server described in the servers described above) to practice improving their pronunciation without depending on a human speech coach. Finally, Larry Masinter, author of RFC 1867, has indicated that graphical paper scanners might be used for applications such as OCR and bar-code upload. "DEVICE=SCANNER" is suggested. The change to the HTML DTD is very simple, but very powerful. It enables a much greater variety of services to be implemented via the World-Wide Web than is currently possible due to the lack of a peripheral input upload submission facility. This would be a very valuable addition to the capabilities of the World-Wide Web. 8. Author's address and acknowledgments James Salsman Cisco Systems, San Jose, California Bovik Research Inst., a non-profit organization 1285 Montecito Ave Apt 57 Mountain View, CA 94043 Email: jps@bovik.org, jsalsman@cisco.com Phone: (650) 967-2737 Larry Masinter and Harald Alvestrand contributed excellent advice. Ed Tecot contributed the means of device and media independence. David McMillian contributed to the description of capabilities of the audio widget. Syracuse Language Systems, The Learning Co., and EduSoft, Ltd., contributed much of the inspiration; Jack Mostow and Greg Aist filled in the real-time details for younger grades. "TOEFL" and "Test Of English as a Foreign Language" are registered trademarks of Educational Testing Service. References [RFC 1867] Form-based File Upload in HTML. E. Nebel & L. Masinter, November 1995. [RFC 1806] Communicating Presentation Information in Internet Messages: The Content-Disposition Header. R. Troost, S. Dorner, June 1995. [RFC 2068] Hypertext Transfer Protocol -- HTTP/1.1. R. Fielding, J. Gettys, J. Mogul, H. Frystyk, & T. Berners-Lee, January 1997. [RFC 1889] RTP: A Transport Protocol for Real-Time Applications. H. Schulzrinne, S. Casner, R. Frederick, & V. Jacobson, January 1996. [RFC 1890] RTP Profile for Audio and Video Conferences with Minimal Control. H. Schulzrinne, January 1996. Additional references: [1] http://www.cybertranscriber.com/ -- automatic transcription from spoken dictation from Speech Machines Corp. [2] http://www.ordinate.com/ -- testing of English fluency, listening, and vocabulary from Ordinate Corp. [3] http://www.cs.cmu.edu/~listen/ -- literacy instruction from a reading tutor that listens from Carnegie Mellon's Project LISTEN END OF INTERNET-DRAFT Filename: