RDF hacking

SWIPE 0.1 specification

Abstract

SWIPE is a simple RDF vocabulary that provides some basic facilities to support the extraction of structured RDF data from arbitrary HTML, XHTML and pseudo-HTML textual content. SWIPE can be used to support simple screenscraping and meta-search applications, or extended (like RSS) to more richly describe interfaces to Web data services.

Authors

Version

Latest Version: http://rdfweb.org/2001/01/swipe/

0.1 (unpublished; unstable; uncirculated...)

Status

This is a working discussion note and should not be considered a stable target for implementation.

Comments should be directed to the RDFWeb-dev mailing list (rdfweb-dev@egroups.com), copying RSS-Dev (rss-dev@egroups.com) for RSS-related issues.

Rights

Copyright © 2000,2001 by the Authors.

(this paragraph copied without permission from the RSS 1.0 Specification :-)

Permission to use, copy, modify and distribute the SWIPE Specification and its accompanying documentation for any purpose and without fee is hereby granted in perpetuity, provided that the above copyright notice and this paragraph appear in all copies. The copyright holders make no representation about the suitability of the specification for any purpose. It is provided "as is" without expressed or implied warranty.

This copyright applies to the SWIPE Specification and accompanying documentation and does not extend to the data format itself.

Overview

SWIPE is a simple RDF vocabulary that provides some basic facilities to support the extraction of structured RDF data from arbitrary HTML, XHTML and pseudo-HTML textual content. The SWIPE vocabulary is in particular intended for use with ill-formed markup (typically online search results) that may not be parseable using more formal SGML and XML based tools. SWIPE descriptions are associated with one or more online searchable services that are typically accessed using the HTTP protocol, and that typically return HTML or pseudo-HTML in response to a search request consisting of a number of attribute/value pairs. The combination of a SWIPE service description and some textual data returned from a query to that service provides the basis for data extraction tools to generate an RDF data graph representing (some aspects of) the returned data.

Goals

SWIPE is designed to provide a relatively simple, practical tool that can be used to encapsulate ad-hoc, human oriented Web services behind a more machine oriented interface. As such it might be used alongside specifications such as XML-RPC, SOAP etc that offer API or message-baed Web data interfaces, or with tools in the WWW::Search, WIDL, and Sherlock tradition that are more concerned with "screen scraping" data from arbitrary HTML-formatted search results. XSLT, the XML transformation language, is another related technology. Where appropriate (eg. search result pages that are in XHTML format), a SWIPE service description can reference an XSLT sheet instead of employing the more basic regular-expression extraction language described below.

SWIPE and RDF Site Summary (RSS)

SWIPE descriptions are intended for general use by RDF tools, but is in particular for use as an extension module with the RDF Site Summary (RSS) vocabulary. The base RSS 1.0 specification allows for the description of a Web content feed as a channel consisting of a list of items (such as news, updated pages, announcements etc). RSS 1.0 also allows for a very simple characterisation of a search facility associated with a Web site or channel. SWIPE can be used to augment that description with additional meta-information to allow RSS 1.0 tools to better process search results from the search services mentioned in RSS site descriptions. This might be used, for example, to aggregate search results from a distributed search of several RSS-described data sources, or to provide a common user interface for managing and navigating search results.

Non-Goals

This vocabulary is not intended to replace the richer facilities offered by fully-featured search protocols such as Z39.50, DASL (WebDAV), LDAP etc. It is also not intended to serve as a general purpose machine interface (API, query language etc.) to XML or RDF networked data sources. Future extensions to SWIPE may provide for better interoperability with more sophisticated (and heavyweight) specifications.

SWIPE Vocabulary

The SWIPE vocabulary is divided into "Basic" and "Util" sections, reflecting a pragmatic, tool oriented approach. Additional utility constructs may be added in future revisions to this specification, or by provided by 3rd party extensions. The SWIPE-Basic core is a very simple set of properties and types that should allow simple, practical tools to be easily constructed using generic RDF APIs.

SWIPE-Basic

The following properties and types are defined.

Properties

swiper
The swiper relation connects our SWIPE information to some identified Web service. Rather than use the search resource (CGI-script, servlet etc) as the identifier for the service, we use the 'home page', eg. http://oreillynet.com/meerkat/. Consequently, we can use the swiper properties of a Web service to get to a bundle of RDF properties that describe how to interact with that service. The range of the swipe property is SwipeSpec
in
The in relation is used on a SwipeSpec, and points to an RDF container listing one or more SWIPE DataItems.
out
The out relation is used on a SwipeSpec, and points to SWIPE ParseRules. The interpretation of the "parse rules" info depends on the format(s) used; we indicate this using a parseformat property on the ParseRules.
action
The action property, like the HTML forms attribute of the same name, indicates a Web service that can be respond to parameters passed via HTTP GET or POST methods.
method
The method property, like the HTML forms attribute of the same name, indicates the (@@TODO: or 'a'?) HTTP method(s) through which a Web service offers an interface. (@@TODO: extensions? SOAP/XP/XML-RPC?)
macfile
The macfile property (which perhaps belongs in the utility namespace) tells us where (if anywhere) we can find an Apple Macintosh Sherlock plugin for this service.
resultListStart
A text string representing the point in a document from which content to be extracted becomes available. The sub-portion of the document identified can be mapped into the RSS notion of a channel.
resultListEnd
A text string representing the last point in a document from which useful content might be extracted.
resultItemStart
A text string used for the repeated extraction of RSS items from a larger textual document.
resultItemEnd
A text string used for the repeated extraction of RSS items from a larger textual document.

TODO: more properties are needed for richer extraction. Write Schema.

Classes

SWIPE-Basic defines the following RDF classes.

SwipeSpec
A SwipeSpec provides a collection of properties describing the expected inputs and likely outputs for some Web-accessible data service.
BasicSpec
A BasicSpec is a kind of SwipeSpec; the information associated with a BasicSpec can be used to extract RSS-like records from Web data services, adopting a "screen scraping" approach.
ParseRules
DataItem
DataItem is a super-class for Input and UserInput
Input
An Input is a pairing of a name and (optionally) some content (plain text) that correspond to the attribute-value pairs typically used in HTML FORM / HTTP CGI interactions.
UserInput
An Input whose value is typically supplied by an end-user.

Notes:

BasicSpec is a sub-class of SwipeSpec. A BasicSpec will often be described using properties from other namespaces such as the RSS Syndication (@@TO: refs, status) and Taxonomy modules.

SWIPE-Util

[to be specified]

SWIPE-Util will...

Standalone Use

Here we show the use of the Swipe vocabulary in a stand-alone RDF description. It can also be used as an extension module for use with RSS and Dublin Core applications.

(meerkat.swp)
<rdf:RDF xml:lang="en"
             xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
             xmlns:dc="http://purl.org/dc/elements/1.1/"
             xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
             xmlns:rss="http://purl.org/rss/1.0/"
             xmlns="http://rdfweb.org/2000/01/swipe-ns#">

<rdf:Description rdf:about="http://oreillynet.com/meerkat/">
<dc:title>Meerkat: An Open Wire Service</dc:title>
<dc:description>
  Meerkat is a Web-based syndicated content reader 
  providing a simple interface to RSS stories.
</dc:description>
<dc:creator>Rael Dornfest</dc:creator>
<dc:publisher>The O'Reilly Network, O'Reilly &amp; Associates, Inc.</dc:publisher>

<swiper>
  <BasicSpec rdf:about="" method="GET">
  <action  rdf:resource="http://oreillynet.com/meerkat/sherlock"/>
  <macfile rdf:resource="http://oreillynet.com/meerkat/etc/sherlock/meerkat.sit"/>

  <!-- the RSS syndication vocabulary tells us how often to refresh the data -->
  <sy:updatePeriod>daily</sy:updatePeriod>
  <sy:updateFrequency>7</sy:updateFrequency>
  <sy:updateBase>2001-01-01T12:00+00:00</sy:updateBase>

  <!-- todo: banner image / text /link, use rss and util vocabs -->  

  <!-- incoming data needed by web service -->
  <in>
   <rdf:Seq>
   <li><Input name="t" content="7DAY"/></li>
   <li><Input name="_fl" content="sherlock"/></li>
   <li><UserInput name="s"/></li>
   </rdf:Seq>
  </in>


  <!-- interpretation rules for output from web service -->
  <out>
   <ParseRules
        resultListStart="&lt;meerkat&gt;"
    resultListEnd="&lt;/meerkat&gt;"
    resultItemStart="&lt;story&gt;"
    resultItemEnd="&lt;/story&gt;">
        <!-- here we use a simple text-match approach -->
     <parseformat rdf:resource="http://www.apple.com/sherlock/"/>
    </ParseRules>
  </out>
  <!-- XSLT and other output format handlers would be listed here -->
 </BasicSpec>
 </swiper>

</rdf:Description>
  

Figure 1: Meerkat.swp.gif

Select image for the full picture.

RDFViz diagram

Using Swipe as an RSS Module

example goes here...

SWIPE and Sherlock

SWIPE descriptions can be used to create Sherlock channels compatible with The Apple 'Sherlock' plugin format. [MacSherlock]. Conversely, SWIPE can provide an open, modular and extensible representation for the metadata encoded within Sherlock plugins. RDF-capable browsers such as Mozilla (and Netscape 6.0) that implement a Sherlock-like search system can use RDF datasources to interchange search service descriptions. Similarly, online services such as Sherch which understand the Sherlock plugin format will be able to exploit SWIPE descriptions supplied via RSS syndication.

SWIPE and Mozilla

Mozilla, the opensource browser and Web application toolkit, makes heavy use of RDF and XML. The Mozilla RDF documentation ([MozillaRDF]) provides more information on the Mozilla RDF APIs and RDF-based services than can be presented here. In particular, see the Mozilla search documentation [MozillaSearch] for details of the Sherlock-compatible search tool built into Mozilla.

Excerpt...

The core search functionality in Mozilla is a XPCOM component which uses RDF as its data store, Necko for networking support, and XUL/CSS & JavaScript for its user interface, with a bit of XPConnect for "glue" support [...] Mozilla currently supports version 1.3 of the "Sherlock 2" specification with the exception of "LDAP" support.

References

[P2PMeta]
The Power of Metadata, by Rael Dornfest and Dan Brickley. Openp2p.com, 2001-01-18
[ExtRSS]
RDF: Extending and Querying RSS channels, Dan Brickley and Libby Miller. ILRT discussion document, 2000-11-01
[MacSherlock] Apple - Mac OS - Sherlock
See online documentation and TechNotes from Apple, particularly Technical Note TN1141, Extending and Controlling Sherlock
[MozillaRDF]
RDF in Mozilla documentation
[MozillaSearch]
The Search for Mozilla, by Robert John Churchill (rjc@netscape.com).