Thursday, November 29, 2007

How to start thinking about RDF.

While trying to understand RDF, its capabilities and its limitations I have done a lot of reading of earlier work on semi-structured data management. It occurs to me that I haven't blogged a paragraph from an 1995 paper on heterogeneous information sources that really crystalised for me the crucial difference between RDF and previous attempts at semi-structured data (especially XML and friends) - better late than never I suppose...

We need note define in advance the structure of an object ... no notion of a fixed schema or object class. A label can play two roles: identifying an object (component) and identifying the meaning of an object (component). If an information source exports objects with a particular label, then we assume that the source can answer the question "What does this label mean?". It is particularly important to note that labels are relative to the source that exports them. That is, we do not expect labels to be drawn from an ontology shared by all information sources.

Papakonstantinou, Y. et al. "Object Exchange Across Heterogeneous Information Sources", 11th Conference on Data Engineering, IEEE Computer Society, 251-260, 1995.

The actual framework presented in the paper is focused on intraorganisational integration and so doesn't have the property of anarchic scalability required if it is to be applied to the internet - however it does express clearly this need for source-defined semantics if semi-structured data is to be tractable.

2 comments:

Josh Robb said...

Hey - how does this work relate conceptually to projects like CouchDB or the new Amazon Dynamo/Simple DB?

(Hi by the way - email me - I want to catch up ;).

Andrae Muys said...

I'm not sure as I haven't looked at either of them - however after a (very) quick google - RDF is similar tuplespaces (mentioned as the basis for simpleDB) only with a stronger notion of source-defined-semantics, and a couple of semantic differences that permit the anarchic scalability required for use at internet scales.

With only a 1 minute scan to go on, CouchDB looks more like a LotusNotesDB clone than a semi-structured datastore. Specifically (looking at CouchDb Quick Overview, specifically its comment: To address this problem of adding structure back to semi-structured data, CouchDB integrates a view model using Javascript for description. suggests that couchDB is discarding about 2 decades of lessons and research on how to manage semi-structured data.

Of the two, I would be paying attention to simpleDB - however I am firmly of the opinion that both are obsoleted by the various RDF datastores now available (including mulgara of course :).

I would love to catch up - unfortunately I don't currently have an email address for you :(.