Website Parse Templates Working Group Av. Manukyan Internet-Draft Ar. Manukyan Intended status: Informational Ar. Mailyan Expires: October 17, 2008 Al. Sayadyan WebsiteParser.com April 15, 2008 Website Parse Templates draft-manukyan-website-parse-templates-00 Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on October 17, 2008. Copyright Notice Copyright (C) The IETF Trust (2008). Abstract This document defines general concepts and terminology for Website Parse Templates creation that are used to provide web crawlers/data parsers with proper information about website structure and content. Manukyan, et al. Expires October 17, 2008 [Page 1] Internet-Draft Website Parse Templates April 2008 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 2. Website Parse Template Structure . . . . . . . . . . . . . . . 3 2.1. Ontology . . . . . . . .. . . . . . . . . . . . . . . . . 4 2.2. Template . . . . . . . . . . . . . . . . . . . . . . . . 5 2.3. URLs . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3. Security Considerations . . . . . . . . . . . . . . . . . . . 6 4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 6 5. References . . . . . . . . . . . . . . . . . . . . . . . . . . 7 5.1. Normative References . . . . . . . . . . . . . . . . . . 7 5.2. Informative References . . . . . . . . . . . . . . . . . 7 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 7 Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 8 Disclaimer of Validity . . . . . . . . . . . . . . . . . . . . . . 8 Intellectual Property . . . . . . . . . . . . . . . . . . . . . . 8 Manukyan, et al. Expires October 17, 2008 [Page 2] Internet-Draft Website Parse Templates April 2008 1. Introduction Website Parse Template is a specification for website structure and content description for web crawlers. It is an effective way to provide web crawlers with proper web page templates to parse website content more accurately co-ordinating the same object attributes used in different pages of the same website. Web page structure and content description represented as Website Parse Template is to be referred to as Website Parse Template file (with ".icdl" extension). Website Parse Template is compatible with existing Semantic Web[3] concepts defined by World Wide Web Consortium[4] (Resource Description Framework - RDF[1] and Web Ontology Language - OWL[5]) and Universal Networking Language - UNL[6] specifications. 2. Website Parse Template Structure Website Parse Template consists of following sections: Ontology - enumeration and definition of all concepts used in certain web site; Templates- web site's structural elements associations with defined concepts; URLs - URLs or URL patterns that are comply with described templates. Single Website Parse Template is referred to the same host, while single host may have several Website Parse Templates describing its structured content. The host should be specified at the beginning of corresponding Website Parse Template file: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ . . . . . . . . . . . . . . . . _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ Manukyan, et al. Expires October 17, 2008 [Page 3] Internet-Draft Website Parse Templates April 2008 2.1. Ontology Ontology section defines all concepts used in website. Defined concepts must be enclosed within tags. It is required to specify the ontology name and indicate language that is used to describe the concepts. E.g. the concept "artist", which is inherited from "person" and may have attributes "name", "image", "bio", "track", "video", etc., has the following representation: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The ontology name can be any comprehensive string, but for the language it is necessary to indicate supported language type, e.g. "icdl:ontology", "owl", "unl:uws". "inherit" tag shows inheritance relation between two concepts, "has" tag shows attributable relations. Either of defined concepts has default attribute - object identifier (id) to be used by web crawlers to co-ordinate the same object's attributes used in different pages of the same website. Website Parse Template foresees following predefined concepts that are general for all kind of websites: "Menu" - navigation bar/menu "Logo" - design element/logo "Content" - element that contains main textual content of the page "Advertisement" - advertisement/banner "External Link" - element that contains external links Manukyan, et al. Expires October 17, 2008 [Page 4] Internet-Draft Website Parse Templates April 2008 2.2. Template The template section describes website's HTML structure and content using concepts defined in ontology section. The template should be enclosed within tags. E.g. the artist web page may have certain structure with following simple website template: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The template name can be any comprehensive string, but for the language it is necessary to indicate supported language type, e.g. "icdl:template", "rdf", "unl:expression". HTML structure elements represented in template section are indicated by "xpaths"[2] or "tag IDs". The web page may contain structured repeatable content ("repeatable block") included in one main structural element ("container"). If specified structural element is already described by another template the "reference" tag can be used to point to that "template block" as follows: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ It makes possible to create hierarchic relations between the same website templates so that web crawlers can use specified reference(s) to identify the same object in different pages of a given web site. Manukyan, et al. Expires October 17, 2008 [Page 5] Internet-Draft Website Parse Templates April 2008 2.3. URLs This section defines the URLs or URL patterns that are corresponding to described website parse templates. Listed URLs/URL patterns should be enclosed within tags as follows: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ "template" shows which one of the described website templates corresponds to listed URLs/URL patterns. Regular Expressions[7] are used for URL patterns descriptions. The concepts necessary for URL pattern definition (such as "name" and "id") are to be defined previously in ontology section. 3. Security Considerations The syntax rules for Website Parse Templates creation that is documented here pose no direct risk to computers and networks. However, people can use these rules to build Website Parse Templates that are inaccurate or even deliberately misleading, which reflects incorrect parsing of website structured content. Systems that are based on Website Parse Templates need to consider issues related to its accuracy and validity as part of their design and implementation, and users of such systems need to consider the design and implementation assumptions. 4. IANA Considerations This document has no actions for IANA. Manukyan, et al. Expires October 17, 2008 [Page 6] Internet-Draft Website Parse Templates April 2008 5. References 5.1 Normative References [1] A. Swartz, "application/rdf+xml Media Type Registration", RFC 3870, September 2004 [2] J. Boyer, M. Hughes and J. Reagle, "XML-Signature XPath Filter 2.0", RFC 3653, December 2003 5.2 Informative References [3] Ivan Herman, "W3C Semantic Web Activity", http://www.w3.org/2001/sw/, April 2008 [4] I. Jacobs, "About the World Wide Web Consortium (W3C)", http://www.w3.org/Consortium/, February 2008 [5] Guus Schreiber, Mike Dean, Frank van Harmelen, Jim Hendler, Ian Horrocks, Deborah L. McGuinness, Peter F. Patel-Schneider and Lynn Andrea Stein, "OWL Web Ontology Language Reference/W3C Recommendation", http://www.w3.org/TR/owl-ref/, February 2004 [6] "Introduction of the UNL", http://www.unl.ru/introduction.html [7] J. Goyvaerts,"Regular Expression Tutorial", http://www.regular-expressions.info/index.html, November 2007 Authors' Addresses Avet Manukyan WebsiteParser.com Email: contact@websiteparser.com URI: http://www.websiteparser.com Armen Manukyan WebsiteParser.com Email: contact@websiteparser.com Arthur Mayilyan WebsiteParser.com Email: contact@websiteparser.com Alexander Sayadyan WebsiteParser.com Email: contact@websiteparser.com Manukyan, et al. Expires October 17, 2008 [Page 7] Internet-Draft Website Parse Templates April 2008 Full Copyright Statement Copyright (C) The IETF Trust (2007). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document may not be modified, and derivative works of it may not be created, except to publish it as an RFC and to translate it into languages other than English. Disclaimer of Validity This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Manukyan, et al. Expires October 17, 2008 [Page 8]