This is a quick writeup of an application of RDF query tools to the problems of weblog discovery and filtering. It shows off RDF's ability to merge XML-encoded data from multiple sources, answering queries which couldn't be answered by considering each source separately.
Imagine you are looking for Weblog URLs (and perhaps their associated RSS feeds) of people who have written Perl software that is in CPAN. At our disposal are two data sources: a data dump listing the SHA1 hashes of CPAN author email addresses. For each author, we have an @cpan address, plus one of their normal addresses (cpan2foaf.pl). A second RDF/XML source of FOAF data is a document listing other information about several people who happen to write for CPAN. That file is a stand-in for their real FOAF files, which would typically be collected from each of their sites individually. For simplicity, we show that data as a single document.
The first datasource mentions several hundred people who contribute to CPAN, and enough information to identify them (by hash of email address). The second datasource contains some of the same identifiers (their hashed home email address) plus, more interestingly, information about their names, homepages, weblogs and suchlike. How can we get answers to questions like "find me weblog addresses for CPAN contributors?".
So... we load the RDF up into an (SQL-backed) RDF database, make sure to merge the data (more on which another time), and it is available for query.
Here's the SquishSQL query we send to the RDF software:
SELECT ?name, ?homepage, ?weblog, ?x, ?feed,
WHERE (foaf::name ?x ?name)
(dc::contributor ?x http://www.cpan.org/)
(foaf::homepage ?x ?homepage)
(foaf::weblog ?x ?weblog)
(foaf::rssfeed ?weblog ?feed)
USING dc for http://purl.org/dc/elements/1.1/
foaf for http://xmlns.com/foaf/0.1/
This RDF query language is called Squish as it is loosly SQL-ish in the way it is used. We send the database a question asking for a selection of fields, and we get back a table of results. Here are name/homepage/weblog 'hits' corresponding to the rows returned by the query system:
Earle Martin (http://downlode.org/ weblog: http://downlode.org/blog.pl) Jo Walsh (http://www.zooleika.org.uk weblog: http://www.zooleika.org.uk/blog.html) Norm Walsh (http://nwalsh.com/ weblog: http://norman.walsh.name/)
Here is a simple Ruby lookup script which extracts the tabular result set from the database. The query extracts information (in this case weblog address etc) about people who contribute to CPAN and are known to the database.
Bugs and disclaimers:
I used the dc:contributor property incorrectly in this demo. It should relate a document to one of its contributors. I used it backwards, and slightly incorrectly, by relating a person to a homepage of a project they contribute to. I should use foaf:project instead.
The demo links are also a bit inconsistent, in that I added in a clause to the query which also asks for the RSS feed of a weblog, but didn't commit the new resultset output to the website yet. I'll try to get that fixed.
This brief writeup assumes perhaps a bit too much background knowledge for those new to RDF to follow in detail, also things like SHA1 'hashing' could do with elaboration. Still I hope the basic approach is clear: RDF allows us to merge disparate datasources and get back answers that depend on their combination. And that we can use that for filtering and discovery of weblogs, given a description of some weblogs and their authors.
Posted by danbri at June 23, 2003 02:21 PM