Thursday, August 30, 2007

RDF, Document Repositories

I have recently had several discussions with people working on/with various document repository projects. One thing I hear is the increasing understanding of the importance of semantics in repository metadata management. I'm naturally pleased to see this because storage, query, and management of semantics is precisely where projects like mulgara come in.

Now digital repositories are alot more than just a semantic store, there are also issues associated with the actual document storage, retrieval, and the QA, workflow, and various metadata extraction tasks. However these days most repository projects include some sort of semantic store to manage metadata, and the question is reasonably asked - why not just use that?

The reason I recommend considering using mulgara to augment a document repository is due to the additional flexibility gained by doing so. When a repository developer approaches the metadata problem they have a natural tendency to view adopt an 'instance-focused' view of metadata. This is where the focus is on the 'document', where each document is of a given 'document type', and each 'document type' implies corresponding 'attributes'. In contrast, RDF is 'property-focused' --- where the focus is on defining the 'meaning' of 'properties', and a 'document' having a given property 'classifies' the document as belonging to a given 'class of documents'.

While the two are duals they do influence design. If you are taking an instance-focused approach you will find yourself heading towards document-type hierarchies and defined attribute lists. If you take a property-focused approach you will find yourself defining lexicons and ontologies. The former tends towards relational approaches, the latter towards semantic approaches such as RDF. The reason why I believe the semantic approach to be superior is in its flexibility. Hierarchies are predicated on centralised control. Even if you can maintain central control over the document type definitions and their attributes, the very act of standardisation in this manner leaves your project with the unfortunate choice of 'scale' or 'responsive to change'.

RDF and its related standards allow the decentralisation of attribute/property definition, while providing the tools to manage the resulting 'mess'. With the combination of namespaces, RDFS, and assuming the provision of schema repositories, it becomes possible to allow global use and reuse of locally defined 'models'. This is especially relevant when you consider that 'local' should be considered in both temporal and spatial senses - keeping the system responsive both to the differing needs of independent organisations, and the changing requirements of a single entity.

The result is the need for two additional boxes in your architecture diagram. The first, a vocabulary server that allows users to define their own vocabulary extensions, and makes those definitions available to applications and critically capable of management and eventual rationalisation. The second is a metadata server that can store the resulting data and permit ad-hoc querying by applications, and the inferencing required by vocabulary rationalisation. One of the reasons mulgara exists is to provide the storage and query engine required by these components - and so I do enjoy having the chance to talk to people about it.