July 10, 2003

Identifying things in FOAF

There is growing interest in FOAF and its relationship to various approaches to "identity management" on the Internet. The FOAF approach to all this is distinctly pluralistic, to the extent that you might not even notice that there is a FOAF way of dealing with identity. There aren't, for example, 'FOAF identifiers' as such, although there is certainly a FOAF approach to identifying things. So this is a first cut at writing up some of the as-yet-unarticulated design assumptions behind FOAF. A more user-friendly version would have examples, those will have to come later.

So here's the basic story. FOAF is built on top of W3C's Resource Description Framework (RDF), which itself uses XML and Unicode as file format standards. All FOAF documents are RDF documents, and any RDF application vocabularies (such as Dublin Core, RSS 1.0 core + extensions, MusicBrainz, Wordnet etc.) can be used within FOAF documents. FOAF shares with RDF a concern to use standard Web identifiers (URIs) wherever possible. The URI specification (RFC 2396) provides a common syntax for naming things on the Web, providing an umbrella concept which covers both 'URLs' and 'URNs'.

To the extent that everything we want to talk about has a well known URI, this solves all our problems. Lots and lots of things that we want to talk about do have URIs. There are URIs for Web pages, for mailboxes, for Java classes, for telephones, for ISBN-registered publications, and so on. This is great - when you want to talk about one of these things in a FOAF file, you just mention its URI. Simple, decentralised, standard.

However our story doesn't end here, FOAF needs to play in a world where we don't all have total knowledge of every relevant fact. Sometimes a thing might 'have' a URI (in some pedantic sense) yet 99% of parties on the Web might not know what that URI is. Or, closer to my main theme, we might want to talk in our FOAF files about things that it has proved peculiarly difficult to get agreement about identifying. People, for example.

Just try setting up a planet-wide system for identifying people and you'll see my point. There is significant resistence to the idea of creating a single set of identifiers used to 'tag' everyone. To put it mildly. So... where does this leave FOAF? FOAF documents are scattered around the Web, and each document makes a unique contribution to a bigger picture which can only be seen when those documents are merged together. In FOAF, we need to identify people, without there being agreement on person-identifiers. Tricky!

So here is the good news. RDF was designed for generic, cross-domain data merging. Imagine taking two arbitrary SQL databases and merging them, so that your new database could answer questions which required knowledge of things which were previously described partially in one dataset, and partially in another. That sort of operation is hard to do, because SQL wasn't designed in a way that makes this easy. Neither was XML. But RDF was, and FOAF is built as an RDF application. In RDF, there are off the shelf software tools which can take RDF documents, 'parse' them into a set of simple 3-part statements (triples) which make claims about the world, and store those statements alongside others in a merged RDF database. To the extent that both datasets use the exact same identifiers when mentioning things they describe, you get a rather handy data-merge effect.

So here is the (not very) bad news. If two different RDF files (eg. FOAF documents) are talking about the same thing but don't use exactly the same URI when mentioning that thing, how are our poor stupid computers supposed to be able to understand? In the real world, we want to write RDF documents (eg. for FOAF) about things that we've not yet agreed on common identifiers for. This is one of the core problems we've had to address in FOAF.

Basically, off the shelf RDF tools can still do a lot to help us, but we have to help them. FOAF, as an application that focusses on the distributed, decentralised, almost out of control use of RDF 'in the wild', ran into this problem after we had about half a dozen FOAF files. There are now hundreds, soon thousands, of FOAF documents. Most of them talk about people, quite successfully, despite the absence of a global person-id registry. This sounds like a recipe for chaos, yet somehow many of our FOAF aggregation tools are quite happy with this situation. They can often figure out when two files are about the self-same thing, without much help from the authors of those documents. We do this using what might be called "reference by description". Instead of saying, "this page was created by urn:global-person-registry:person-n22314151", we say "this page was created by the peson whose (some-property...) is (some-value...)", taking care to use an unambiguous property such as foaf:homepage or foaf:mbox_sha1sum.

Here's how it works. Recall that FOAF is built on top of RDF, and so every FOAF document boils down to nothing more than a set of 3-part statements which relate two things together via terms such as 'workplaceHomepage', 'homepage', 'mbox'.

I am related to those things that are my homepages; FOAF's name for that relationship is 'foaf:homepage'.

I am related to those things that are my personal mailboxes by a relationship FOAF calls 'foaf:mbox'.

I am related to the strings that you get from feeding my mailbox identifiers to the SHA1 mathematical function by a relationship FOAF calls 'foaf:mbox_sha1sum'.

I am related to a myers briggs personality classification, FOAF calls that relationship 'foaf:myersBriggs'.

I am related to my workplace homepage (http://www.w3.org/) by a relationship called -- you guessed it -- 'foaf:workplaceHomepage'.

I am related to my name, 'Dan Brickley' by the 'foaf:name' relationship.

I am related to my AIM chat identifier by a relationship FOAF calls 'foaf:aimChatID'.

And so on. Other RDF vocabularies can define additional relationships (see the FoafVocab entry in our wiki for pointers). They all relate things to other things in named ways. A FOAF document, like any RDF document, is simply a collection of these simple claims about how things in the world relate.

But look again.There is a hidden pattern here. Some of these relationships are special.

foaf:homepage foaf:mbox foaf:mbox_sha1sum foaf:aimChatID fall in one category.

foaf:workplaceHomepage, foaf:myersBriggs, foaf:name fall in another.

Here's the difference. The former kinds of relationship (or 'property' in RDF-talk) have a special characteristic. They have been defined such that there is at most one thing in the world that has any particular value for that property.

There is... at most one thing in the world with any given foaf:homepage. Or foaf:mbox, or foaf:mbox_sha1sum, or foaf:aimChatID. By contrast, there may well be multiple things in the world with the same foaf:workplaceHomepage, or foaf:myersBriggs, or even (it's a big world) foaf:name. Apparently there's another Dan Brickley out there. And lots of my colleagues share my workplace homepage. And there are a lot of people who myers brigg surveys classify as 'INTP' . But there is nobody else at all who has the same foaf:homepage as me, or the same foaf:mbox. Or foaf:aimChatID.

This is one of the design principles underlying FOAF (and for that matter the entire Semantic Web effort): a pragmatic, pluralistic approach to resource description and identification. Rather than building big, centralised registries of people (or companies, or physical things) we look for cheaper, more lightweight shared strategies for identification. In FOAF, we do this by making sure there are multiple ways we can identify things.

So one FOAF file might mention 'here is a photo; it depicts the person whose mailbox is danbri@rdfweb.org'. Another FOAF file might say 'here is a weblog entry written by the person whose homepage is http://rdfweb.org/people/danbri/', a 3rd FOAF file might say, 'here is a chat transcript by the person whose foaf:aimChatID is danbri_2002'. To the extent that there is publically readable RDF in the Web that makes all these claims, and that there is, perhaps scattered around, enough information to deduce that these all describe the same people, RDF /FOAF tools can 'smush' it all together. They could 'realise' that the photo and the weblog and the chat log were all associated with the self-same thing, ie me.

To do that, we need certain pieces of information. We need to know which, of all the kinds of relationship there are, are the uniquely identifying ones. In RDF terminology we call these unambiguous (or more technically, inverse-functional) properties. When RDF software reads the FOAF spec it can determine this from markup embedded in the document itself. So machines can find out quite easily which properties are ones which uniquely identify people. They can do this for the FOAF spec, and for any other RDF vocabulary that is used alongside FOAF.

The other bit of information needed is that somewhere in the Web, it would need to be claimed that there is a person who has a mailbox of ... and a homepage of ... and an aimChatID of ...

If that information is available, then FOAF tools are all set to do the data merge, even though there is no planet-wide unified identification system for people. We don't use anything else except off the shelf standards: URIs plus W3C RDF and OWL technology.

If you find the data merging potential creepy, you are not alone. This kind of technology is not going away, but there are steps you can take. A full discussion of the privacy aspect isn't possible here, but the basic idea is (i) be aware -- scattered information can easily be merged (ii) keep things as secret as they need to be. Don't tell the world (in your FOAF file or elsewhere) all the chat IDs and homepages and mailboxes that you use, then act suprised when people and machines piece together your scattered contributions to the Web. Reading up on PGP might be a good idea.

We don't need to wait for a global identity management system before privacy and data merging becomes an issue. FOAF is intended to explore these issues, and to provide some advance warning for the way certain aspects of semantic web technology may affect our lives. Just as the world has had to adapt to the notion of 'being Googled' and having things that once seemed obscure now all to easily found, the rise of semantic web technology needs to be accompanied by an understanding of the risks and opportunities that 'being identified' presents.

Finally... a couple of points of further reading on the technical rather than social side of this problem. A couple of years ago I wrote a brief note on aggregation strategies which describes the 'smushing' problem. A more recent writeup by Matt Biddulph describing his Java implementation is worth a read too, as are many of the documents from the TAP project, which share FOAF's concern for reference-by-description. Guha and Rob's overview paper sets out the issues very clearly.

Posted by danbri at July 10, 2003 12:05 PM
Comments

Short version:

In FOAF, we use URIs to identify things while describing them.

When we don't have URIs handy, we take care to use identifying properties in our descriptions.

We don't care which properties we use, so long as they are unambiguous. The more that get into general use, the easier it becomes to figure out when two documents are describing the same entity.

Posted by: Dan Brickley on July 10, 2003 01:36 PM

Coolio dude - thanks for this. But the process of smushing is kind of 'challenging' when the end-user holds onto their own foaf.rdf file.

So we have this idea of: a) hosting a database of foaf.rdf files that get 'shared' for FOAFster-type funtionality (if end-users don't care) and b) if they DO care about holding onto their own foaf.rdf files - then we assume that their file is the master, and we'll set up a mirroring process - where any changes to the master - is uploaded to the 'shared' database.

How does that sound?

Posted by: Marc Canter on July 10, 2003 04:12 PM

danbri, nice work.

Marc, I'm not sure the users holding their own files is likely to be such an issue. But to do stuff with the data you'll have to read it into your own system anyway. You could just store the URIs and poll them whenever needed, but it would probably make sense in terms of performance to keep a version in your DB as a cache. This should also lend itself to pretty simple load distribution - it doesn't really matter where the FOAF statements are held. So I think what you're suggesting probably would be a good approach, but for a slightly different reason ;-)

I know there's plenty of work being done with the query languages, but maybe systems like this call for a very simple standardised subset plus some system-level comms. i.e. just two or three fairly FOAF-specific queries (e.g. give me all your Persons with any of these properties) combined with URIs for the stores. Make it easy a possible to implement, but allow sync and the forwarding of queries between DBs (that may even have completely different purposes, but still expose the same mini-interface).

Posted by: Danny on July 11, 2003 01:17 AM

Great, and really illustrative example, Dan.

Marc and Danny: I am thinking about similar things, which I see as:

1. another level of DNS-like mapping to specific documents that gives those documents non-volitile identifiers

2. document stores that automatically plug-in to that DNS-like mapping, and provide interfaces exposing the stored documents to queries

3. tools for querying, caching, and syncing document data across different document stores

(4. safe and easy to integrate with existing/other tools, like websites, blogs, wikis, and email, and with other standards like N/echo and RSS!)

Posted by: Jay Fienberg on July 13, 2003 12:02 AM
Post a comment