Website Parse Templates Working Group Av. Manukyan
Internet-Draft Ar. Manukyan
Intended status: Informational Ar. Mailyan
Expires: October 17, 2008 Al. Sayadyan
WebsiteParser.com
April 15, 2008
Website Parse Templates
draft-manukyan-website-parse-templates-00
Status of this Memo
By submitting this Internet-Draft, each author represents that any
applicable patent or other IPR claims of which he or she is aware
have been or will be disclosed, and any of which he or she becomes
aware will be disclosed, in accordance with Section 6 of BCP 79.
Internet-Drafts are working documents of the Internet Engineering
Task Force (IETF), its areas, and its working groups. Note that
other groups may also distribute working documents as Internet-
Drafts.
Internet-Drafts are draft documents valid for a maximum of six months
and may be updated, replaced, or obsoleted by other documents at any
time. It is inappropriate to use Internet-Drafts as reference
material or to cite them other than as "work in progress."
The list of current Internet-Drafts can be accessed at
http://www.ietf.org/ietf/1id-abstracts.txt.
The list of Internet-Draft Shadow Directories can be accessed at
http://www.ietf.org/shadow.html.
This Internet-Draft will expire on October 17, 2008.
Copyright Notice
Copyright (C) The IETF Trust (2008).
Abstract
This document defines general concepts and terminology for Website
Parse Templates creation that are used to provide web crawlers/data
parsers with proper information about website structure and content.
Manukyan, et al. Expires October 17, 2008 [Page 1]
Internet-Draft Website Parse Templates April 2008
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3
2. Website Parse Template Structure . . . . . . . . . . . . . . . 3
2.1. Ontology . . . . . . . .. . . . . . . . . . . . . . . . . 4
2.2. Template . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3. URLs . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3. Security Considerations . . . . . . . . . . . . . . . . . . . 6
4. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 6
5. References . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1. Normative References . . . . . . . . . . . . . . . . . . 7
5.2. Informative References . . . . . . . . . . . . . . . . . 7
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 7
Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 8
Disclaimer of Validity . . . . . . . . . . . . . . . . . . . . . . 8
Intellectual Property . . . . . . . . . . . . . . . . . . . . . . 8
Manukyan, et al. Expires October 17, 2008 [Page 2]
Internet-Draft Website Parse Templates April 2008
1. Introduction
Website Parse Template is a specification for website structure and
content description for web crawlers. It is an effective way to
provide web crawlers with proper web page templates to parse website
content more accurately co-ordinating the same object attributes
used in different pages of the same website. Web page structure and
content description represented as Website Parse Template is to be
referred to as Website Parse Template file (with ".icdl" extension).
Website Parse Template is compatible with existing Semantic Web[3]
concepts defined by World Wide Web Consortium[4] (Resource
Description Framework - RDF[1] and Web Ontology Language - OWL[5])
and Universal Networking Language - UNL[6] specifications.
2. Website Parse Template Structure
Website Parse Template consists of following sections:
Ontology - enumeration and definition of all concepts used in certain
web site;
Templates- web site's structural elements associations with defined
concepts;
URLs - URLs or URL patterns that are comply with described
templates.
Single Website Parse Template is referred to the same host, while
single host may have several Website Parse Templates describing its
structured content. The host should be specified at the beginning of
corresponding Website Parse Template file:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
. . . . . . . . . . . . . . . .
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Manukyan, et al. Expires October 17, 2008 [Page 3]
Internet-Draft Website Parse Templates April 2008
2.1. Ontology
Ontology section defines all concepts used in website. Defined
concepts must be enclosed within tags. It is
required to specify the ontology name and indicate language that is
used to describe the concepts.
E.g. the concept "artist", which is inherited from "person" and may
have attributes "name", "image", "bio", "track", "video", etc., has
the following representation:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The ontology name can be any comprehensive string, but for the
language it is necessary to indicate supported language type, e.g.
"icdl:ontology", "owl", "unl:uws". "inherit" tag shows inheritance
relation between two concepts, "has" tag shows attributable
relations. Either of defined concepts has default attribute - object
identifier (id) to be used by web crawlers to co-ordinate the same
object's attributes used in different pages of the same website.
Website Parse Template foresees following predefined concepts that
are general for all kind of websites:
"Menu" - navigation bar/menu
"Logo" - design element/logo
"Content" - element that contains main textual content of the page
"Advertisement" - advertisement/banner
"External Link" - element that contains external links
Manukyan, et al. Expires October 17, 2008 [Page 4]
Internet-Draft Website Parse Templates April 2008
2.2. Template
The template section describes website's HTML structure and content
using concepts defined in ontology section. The template should be
enclosed within tags.
E.g. the artist web page may have certain structure with following
simple website template:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
The template name can be any comprehensive string, but for the
language it is necessary to indicate supported language type, e.g.
"icdl:template", "rdf", "unl:expression".
HTML structure elements represented in template section are indicated
by "xpaths"[2] or "tag IDs". The web page may contain structured
repeatable content ("repeatable block") included in one main
structural element ("container").
If specified structural element is already described by another
template the "reference" tag can be used to point to that "template
block" as follows:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
It makes possible to create hierarchic relations between the same
website templates so that web crawlers can use specified reference(s)
to identify the same object in different pages of a given web site.
Manukyan, et al. Expires October 17, 2008 [Page 5]
Internet-Draft Website Parse Templates April 2008
2.3. URLs
This section defines the URLs or URL patterns that are corresponding
to described website parse templates. Listed URLs/URL patterns should
be enclosed within tags as follows:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
"template" shows which one of the described website templates
corresponds to listed URLs/URL patterns. Regular Expressions[7] are
used for URL patterns descriptions. The concepts necessary for URL
pattern definition (such as "name" and "id") are to be defined
previously in ontology section.
3. Security Considerations
The syntax rules for Website Parse Templates creation that is
documented here pose no direct risk to computers and networks.
However, people can use these rules to build Website Parse Templates
that are inaccurate or even deliberately misleading, which reflects
incorrect parsing of website structured content.
Systems that are based on Website Parse Templates need to consider
issues related to its accuracy and validity as part of their design
and implementation, and users of such systems need to consider the
design and implementation assumptions.
4. IANA Considerations
This document has no actions for IANA.
Manukyan, et al. Expires October 17, 2008 [Page 6]
Internet-Draft Website Parse Templates April 2008
5. References
5.1 Normative References
[1] A. Swartz, "application/rdf+xml Media Type Registration",
RFC 3870, September 2004
[2] J. Boyer, M. Hughes and J. Reagle, "XML-Signature XPath Filter
2.0", RFC 3653, December 2003
5.2 Informative References
[3] Ivan Herman, "W3C Semantic Web Activity",
http://www.w3.org/2001/sw/, April 2008
[4] I. Jacobs, "About the World Wide Web Consortium (W3C)",
http://www.w3.org/Consortium/, February 2008
[5] Guus Schreiber, Mike Dean, Frank van Harmelen, Jim Hendler, Ian
Horrocks, Deborah L. McGuinness, Peter F. Patel-Schneider and
Lynn Andrea Stein, "OWL Web Ontology Language Reference/W3C
Recommendation", http://www.w3.org/TR/owl-ref/, February 2004
[6] "Introduction of the UNL", http://www.unl.ru/introduction.html
[7] J. Goyvaerts,"Regular Expression Tutorial",
http://www.regular-expressions.info/index.html, November 2007
Authors' Addresses
Avet Manukyan
WebsiteParser.com
Email: contact@websiteparser.com
URI: http://www.websiteparser.com
Armen Manukyan
WebsiteParser.com
Email: contact@websiteparser.com
Arthur Mayilyan
WebsiteParser.com
Email: contact@websiteparser.com
Alexander Sayadyan
WebsiteParser.com
Email: contact@websiteparser.com
Manukyan, et al. Expires October 17, 2008 [Page 7]
Internet-Draft Website Parse Templates April 2008
Full Copyright Statement
Copyright (C) The IETF Trust (2007).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document may not be modified, and derivative works of it may not
be created, except to publish it as an RFC and to translate it into
languages other than English.
Disclaimer of Validity
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY, THE IETF TRUST AND
THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF
THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Intellectual Property
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the procedures with respect to rights in RFC documents can be
found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
http://www.ietf.org/ipr.
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at
ietf-ipr@ietf.org.
Manukyan, et al. Expires October 17, 2008 [Page 8]