Wednesday, November 25, 2009

JRuby and JNI and OS X.

Like practically everyone I sat down a few years back and picked up enough Ruby to write an address-book app in Rails. Nice enough language, a few nice features[0]. Still at the end of the day... YADTL, so ultimately YAWN[1].

All this is to say that I found myself yesterday facing the need to convert a trivial rails app into a java servlet, with the added complication that it uses a third-party C++ library to do the bulk of the work.

JRuby is a trivial install, and the application in question didn't make use of continuations, so rip out all the Rails code, and replace with a simple loop that takes requests from the cmdline instead of a POST parameter, and we're in business.

Well up to the point where it falls over because it needs the C++ library. Here's the first real hurdle. Apple's JDK for OS X is compiled as a 64-bit binary which can't link against 32-bit libraries[2], and by default the C compiler only produces 32-bit libraries.

$ file /usr/local/lib/libcrfpp.dylib 
/usr/local/lib/libcrfpp.dylib: Mach-O dynamically linked shared library i386
A quick check on google yields these pages that suggest that I need to add -arch i386 -arch x86_64 to the CXXFLAGS and LDFLAGS. So after modifying the Makefile to compile the library with the above flags and install it in /opt/local we have:
$ file /opt/local/lib/libcrfpp.dylib 
/opt/local/lib/libcrfpp.dylib: Mach-O universal binary with 2 architectures
/opt/local/lib/libcrfpp.dylib (for architecture i386): Mach-O dynamically linked shared library i386
/opt/local/lib/libcrfpp.dylib (for architecture x86_64): Mach-O 64-bit dynamically linked shared library x86_64

Then we need to compile the SWIG generated JNI wrapper, which will also require the above parameters, but also needs to link against Apples JNI installation. Back to Google which provides this page from Apple that describes how to compile JNI on Darwin, which provides the include path and an additional link flag.

The command lines to compile the wrapper in question becomes:

g++ -arch i386 -arch x86_64 -c -I/System/Library/Frameworks/JavaVM.framework/Headers CRFPP_wrap.cxx
g++ -arch i386 -arch x86_64 -dynamiclib -L/opt/local/lib -lcrfpp -o libCRFPP.jnilib CRFPP_wrap.o -framework JavaVM

Wanting to avoid calling System.loadLibrary from ruby code I wrote a quick wrapper class that could make the call in a static block:

package crfppwrapper;

import org.chasen.crfpp.Tagger;

public class CRFPP {
    public static Tagger newTagger(String args) {
        return new Tagger(args);
    }

    static {
        System.loadLibrary("CRFPP");
    }
}
Finally modify the relevant ruby file to link to the jar file:
require 'java'
require 'lib/CRFPP-0.53.jar'
and call the wrapper:
crfppwrapper.CRFPP.newTagger("...");

Finally modify the toplevel jruby script to pass the necessary java parameters to the jvm:

#!/usr/bin/env jruby -J-Djava.library.path=/opt/local/lib -J-cp .:lib/CRFPP-0.53.jar -w

And it worked. No more Rails; migrated to jruby; and ready to bundle into a trivial servlet that will reintroduce the POST parameter to the mix.

[0] The implicit closure parameter to every function and an elegant brace syntax for creating the closures themselves does lend itself to some wonderfully elegant idioms.

[1] Of course if you were coming from J2EE you were probably more than happy to exchange static-typing for configuration-by-convention.

[2] And all of a sudden I'm having flashbacks to horror of IRIX binary toolchain management in the late 90's.

Tuesday, September 08, 2009

ldd equivalent under Darwin (OS X)

ldd is an incredibly useful command under linux. It lists the shared-libraries required by an executable, and exactly which .so file each dependency is currently resolved to.

A similarly useful tool is od, which permits inspection of object files, especially executable and shared-libraries, extracting things like the string-table, symbol-table, and various headers and sections.

Under Darwin these tools are combined in otool. ldd can be duplicated by otool -L.

Wednesday, August 26, 2009

Gov2.0 and Open-Source - a response

Cross-posted from Data.gov and lessons from the open-source world at the Gov2.0 Taskforce blog.

My biggest worry is that the government's response to this initiative will be the announcement of some $multi-million grand-gesture. Big press-conference; Minister announcing the 'grand vision'; and the possible benefits we could see, lost in the maze that is large-scale government procurement.

The key insight of CatB is the extent to which redundancy is a benefit in exploratory development.

For the moment, we have no idea of the correct model for Gov2.0 - we have some understanding of what has worked outside of government, and a few promising avenues of approach, but no actual answers.

So I think we want to recommend that different Agencies experiment with different approaches and that the OIC be tasked with:

  1. Examining the success/failure of the different attempts, and eventually start to help agencies improve the success rate.
  2. Ensuring that the legal and regulatory requirements for aggregation and interoperability of/between these different services is standardised, as these are the issues that will derail bazaar development.
  3. Acting as a central clearing house where agencies/organisations/individuals can choose to self-publish interface descriptions, custom schema, metadata element schemes, vocabularies etc
  4. Providing a collaboration and mediation service to allow the reconciliation of conflicting interface/schema/scheme/vocab's.

The result would hopefully be a myriad of exploratory projects, some of which would fail, most of which would be ho-hum, but many of which would succeed.

The OIC would act as an institutional memory, learning and recording the lessons learnt; an institutional coordinator, making sure that people wanting to aggregate/integrate the different data-sources aren't forbidden from doing so; and an institutional mediator, assisting the different projects in finding and working together when they would like to.

Please post any comments to the gov2.0 site listed above, not here

Monday, July 13, 2009

Google’s new OS could hit Microsoft where it hurts

Quoted in a blog post by Andy Goldberg, Andy provides a quote that I suspect a lot of Microsoft sycophants will be telling themselves over the next year:

"Google may or may not have the experience and capability of actually producing an operating system and getting it deployed," he said. "It may not realise how hard it is."

Anyone who takes this line should remind themselves: Chrome OS is not an operating system! - it is browser-based windowing system running on top of an open-source OS; and Google can definitely handle that. Moreover, google has spent the better part of the past decade doing OS design and implementation. It's just that they haven't been selling it, they have been using it internally.

Friday, July 10, 2009

How to Write an Equality Method in Java

Martin Odersky, Lex Spoon, and Bill Venners have written an article on How to Write an Equality Method in Java. Given the amount of traffic my post on the same topic has attracted, I thought I might take a look. Their solution is the best I have seen, and well worth a look, but the article suffers the same problem as every article I've read on this topic: it fails to properly define equals().

A Java equals() method should implement a test for bisimulation.

To me, the remarkable thing about the Artima article is that it provides and example where the traditional getClass() approach I have previously recommended fails to implement bisimulation. The good-news is that this is a rather artificial use-case, and as such isn't going to be causing many bugs. The bad-news is that the fix requires a helper method defined at the topmost concrete class, and the need for it isn't intuitive.

Anyway, if you program in Java you need to read and understand this article.

Thursday, July 02, 2009

Scala and Various Things.

Not quite ready to move across yet - however capturing some links to make life that little bit easier when I do. I'm hoping these approaches will work as well for Elmo/OTM as they do for Hibernate.

I'll be updating this with any more links I want to save for a sunny day.

A couple more links worthy of rereading:

Wednesday, June 17, 2009

Looking for JSON, ReST, (and in-memory RDF) frameworks

Currently writing a number of small web-services to do various informatics tasks (more detailed post to come). Fortunately I'm not the one having to deal with 3rd-party SOAP apis! Still I do need to do various XML and JSON parsing, and not having addressed the latter before I've gone looking for libraries.

Currently I am about to start using Jackson, but was wondering if anyone had any warnings, advice, or recommended alternatives? In the course of looking at what was out there I have also come across Restlet, a ReST framework that seems like it is well worth the time to figure out and deploy, so I will probably be doing that soon as well, any warnings or advice on this will be welcome.

One of the nice things about Restlet is its support for RDF. Granted it doesn't support querying, and the terminology in the interface is a bit confused, but it does use its native Resource interface for URIRefs, so it should integrate well. OTOH, if it does prove useful as a ReST framework, I can see myself writing a quick Sesame or Mulgara extension, as there is only so much you can do with RDF before you need a query and/or data-binding interface.

Thursday, June 04, 2009

Deploying to tomcat using maven

This is a note to capture the process I am currently using to build and deploy a basic web application to tomcat for development.

  1. Create simple web app using archetype.
    mvn archetype:create ... -DarchetypeArtifactId=maven-archetype-webapp
  2. Add server to ~/.m2/settings.xml
      <servers>
        <server>
          <id>local-tomcat</id>
          <username>username</username>
          <password>password</password>
        </server>
      </servers>
    
  3. Add server to toplevel pom.xml
      <build>
        ...
        <plugins>
          <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>tomcat-maven-plugin</artifactId>
            <version>1.0-beta-1</version>
            <configuration>
              <server>local-tomcat</server>
            </configuration>
          </plugin>
        </plugins>
      </build>
    
  4. compile, test, deploy, and subsequently redeploy
      $ mvn compile
      $ mvn test
      $ mvn tomcat:deploy
      $ mvn tomcat:redeploy
    

Of course what I would dearly love to know is how you configure a server url in the settings.xml file. The documentation I can find (http://mojo.codehaus.org/tomcat-maven-plugin/configuration.html) describes how to use a custom url, and how to configure authentication information. What it doesn't do is explain how you can do both at the same time, and all my attempts have resulted in XML validation errors when trying to run maven --- if I figure it out I'll update this post.

Wednesday, May 27, 2009

ANTLR: an exercise in pain.

Ok, so for my current project I need to either build heuristic or machine learning based fuzzy parser. As someone who has written numerous standard parsers before, this qualifies as interesting. Current approaches I'm considering include cascading concrete grammars; stochastic context-free grammars; and various forms of hidden-markov-model based recognisers. Whatever approach ends up working best, the first stage for all is a scanner.

So I start building a jflex lexer, and make reasonable progress when I find out that we are already using ANTLR for other projects, so I should probably use it as well. Having experienced mulgara's peak of 5 different parser generators - this does eventually become ridiculous - I was more than willing to use ANTLR. Yes it's LL, and yes I have previously discussed my preference for LR; but, it does support attaching semantic actions to productions, so my primary requirement of a parser-generator is met; and anyway, it has an excellent reputation, and a substantial active community.

What I am now stunned by, is just how bad the documentation can be for such a popular tool. One almost non-existent wiki; a FAQ; and a woefully incomplete doxygen dump, does not substitute for a reference. ANTLR has worse documentation than sablecc had when it consisted of an appendix to a masters thesis!

My conclusion: If you have any choice don't use ANTLR. For Java: if you must use LL I currently recommend JavaCC; if you can use an LALR parser do so, my current preference is Beaver.

Monday, May 19, 2008

So why can't google provide html for 40% of pdfs?

A google search for opensource filetype:pdf returns the standard 10 results on the front page, but only 6 of them offer a "View as HTML" link. Is it just me, or has this become more prevalent recently? And what is the common property that results in this behaviour?

If anyone has any clues or ideas I would love to hear them.

Tuesday, January 15, 2008

Now this is how conference videos should be presented

A great collection of conference presentations and interviews (mostly on java topics) Parleys - but of particular interest to me is the presentation, the best I have seen.

Thursday, November 29, 2007

How to start thinking about RDF.

While trying to understand RDF, its capabilities and its limitations I have done a lot of reading of earlier work on semi-structured data management. It occurs to me that I haven't blogged a paragraph from an 1995 paper on heterogeneous information sources that really crystalised for me the crucial difference between RDF and previous attempts at semi-structured data (especially XML and friends) - better late than never I suppose...

We need note define in advance the structure of an object ... no notion of a fixed schema or object class. A label can play two roles: identifying an object (component) and identifying the meaning of an object (component). If an information source exports objects with a particular label, then we assume that the source can answer the question "What does this label mean?". It is particularly important to note that labels are relative to the source that exports them. That is, we do not expect labels to be drawn from an ontology shared by all information sources.

Papakonstantinou, Y. et al. "Object Exchange Across Heterogeneous Information Sources", 11th Conference on Data Engineering, IEEE Computer Society, 251-260, 1995.

The actual framework presented in the paper is focused on intraorganisational integration and so doesn't have the property of anarchic scalability required if it is to be applied to the internet - however it does express clearly this need for source-defined semantics if semi-structured data is to be tractable.

Wednesday, October 17, 2007

Implementing a Resource Manger for the Java Transaction API.

One would hope that when an organisation like Sun specifies a standard as important as a transaction api they might take the time to ensure they do a reasonable effort -- unfortunately you would be wrong. The JTA specification is freely available from Sun, however the document is extremely vague and requires the reader to infer the underlying transaction state machine - with the unsurprising result that the corner cases are left completely unspecified. So having waded through this morass myself, I include here my advice on what you need to do to understand JTA, in the hope that anyone else coming to this standard won't have to waste as much time as I have understanding it.

Read the JTA spec last. JTA is a little more than a thin wrapper around the Object Transaction Service 1.1 published by the OMG. This is one of the CORBA standards published in the late 90's and early 2000's - and like most of the early CORBA specs, it is well written; easy to read; and pretty complete. Unfortunately it too leaves the underlying state-machine and some of the corner-cases (especially those to do with recovery) underspecified. As a result I recommend printing this out and referring to it as an adjunct to the JTA spec.

Fortunately the CORBA-OTS is itself an object-oriented wrapper around another spec, the Distributed Transaction Processing: The XA Specification spec' published by X/Open (now The Open Group). There in Chapter 2 you will find most of the definitions missing from the other two specs; and in Chapter 6 the state-tables that provide definitive semantics for the various operations you will need to implement. You will also find a reference to another related X/Open specification - Distributed Transaction Processing: Reference Model - which contains all the remaining definitions and assumptions missing from the the JTA/OTS specs.

So if you do need to implement a JTA interface I strongly recommend you start with the X/Open reference model; then read the X/Open XA Spec; and only then read the JTA specification alongside the OTS spec for elaboration.

Wednesday, September 19, 2007

Model-URI/URL Use-cases and Requirements and Proposal

Just posted this to mulgara-general - posting here to provide readily accessible permanent reference. I would greatly appreciate any comments anyone may have - please also feel free to solicit comments from outside the mulgara community if there is interest.

The Use Cases and Requirements
The three key requirements of a model-URI proposal are:

1. Protocol/Scheme independence
2. Model/Server mobility
3. URI-standards compliance (ie. no fragment)

Also desirable are

4. Unique-name
5. Namespaced to allow a) potential resolution; b) predicable, human-readable URI's.

The context of the most complex use-case involves 4 models and 4 machines (and assumes a Distributed or Federated Resolver)

:modelA is on server1 on host1 and needs 
     to reference :modelB and :modelC
:modelB is on server2 on host2
:modelC is on server3 on host3
:modelD is on server4 on host4 run by an unrelated organisation

The application needs to perform the query:

select $id subquery(
  select $s $p $o 
  where $s $p $o in $locn and 
        $id <mulgara:locatedAt> $locn in <mulgara:modelURLResolver>)
from host1:modelA
where [ <:useModel> $identifier ] ;

Which queries each model listed in :modelA after converting their identifier into a URL via a posited resolution mechanism.

Now host2 fails, and we restore server2 on host3 to run alongside server3.

We would like to be able to have the query run unmodified.

What this means is that :modelB cannot encode host2 in its URI.

The URI does need to encode some sort of server-id as servers are guaranteed to use the same model-names at least some of the time (consider all system-model's have the name "").

Also because :modelD and :modelA-C are managed by unrelated organisations we must somehow encode the organisation in the model's URI-stem as they may well decide to use the same server-id ("server1" or "database" anyone?).

Also consider that any encoding of the organisation must also allow that organisation to maintain their own independent registry, or the proposal ceases to be scale-free (it's on this that the original UUID proposal floundered).

I have considered abandoning requirement 4, and just using URL's. However ultimately we require a canonical name for internal purposes (even if it isn't exposed externally), and so even using URL's we would have to pick a designated 'unique name' for the model - we can't escape that - so we might as well save ourselves the headache and make it unambiguous.

So a summary of my thinking on the use-cases/requirements for rdf model-names - we desire:

1. Unambiguously an identifier
2. Encodes organisation
3. Encodes server-id
4. Doesn't encode hostname
5. Potentially resolvable via a per-organisation registry

* Proposal

If we wish to be unambiguous then we should use our own URI-scheme. This has the added benefit that once we use our own scheme we have a lot more flexibility regarding how we structure the rest of the URI to meet our requirements.

I am proposing to use the scheme 'rdfdb' - as did the original UUID proposal.

I would prefer to avoid the use of opaque URI's; there is no reason why our URI can't be introspected if we structure it sanely - so the structure according to RFC2396 will be 'rdfdb://authority/path'.

Logically the model-name itself makes a good path so we arrive at 'rdfdb://authority/modelName'. Leaving the need to encode an organisation and a server-id in the authority in a fashion that will potentially permit resolution via a registry.

Now as the authority is not a hostname, RFC2396 identifies us as a "Registry-based Naming Authority". As such, the characters we have permitted to us are [ - _ . ! ~ * ' ( ) $ , ; : @ & = + ] (excluding the []'s) - and the characters reserved are [ ? / ].

I therefore propose to structure the authority 'server-id~organisation-id' (that is the server-id and org-id separated by a tilde).

At the moment we don't support hierarchical server-id's; but I would like to leave us the option of doing so once we start supporting more aggressive distribution. We also need to consider that it needs to remain a valid path-element for use in our existing model-URL's. So for now I would like to limit server-id to what we currently use, but ultimately I think we should consider some sort of delimited hierarchical form (probably dotted).

The organisation-id should be something that will eventually permit the identification of a registry. For now a dotted hierarchical form should suffice - although I will make sure the implementation leaves this as open as possible (the use of a tilde makes this possible).

It has also been suggested that to make it unambiguously clear we are *not* encoding a hostname as the organisation-id we should invert the traditional dns-style representation.

So putting all the pieces together: If I am running a mulgara server -

host:         pneuma.netymon.com
organisation: netymon.com
server-id:    rdfDatabase
model-name:   addressBook

The model URL for addressBook remains: rmi://pneuma.netymon.com/rdfDatabase#addressBook
or: soap://pneuma.netymon.com/rdfDatabase#addressBook ...etc...

and the model URL for the model is: rdfdb://rdfDatabase~com.netymon/addressBook

Monday, September 17, 2007

Operational vs. Denotational Semantics

Spent a little time this afternoon discussing several topics with LB and SR. One topic we touched on was our continuing efforts to understand the distinction between denotational and operational semantics - I continue to be surprised at just how hard it's proving to nail down the precise distinction.

Looking at the various scrawls on my whiteboard that are the only physical remnants of what was a fascinating discussion, I gave some more thought to the distinction, and I believe it can be described thus:

Operational : M ( P ) |= σ -> σ'
Denotational : M ( P ) |= κ -> P'
ie. In Operational semantics the meaning of a program is a transition function on a virtual machine.
in Denotational semantics the meaning of a program is a mapping from an initial basis to a new (simplified) program.

Now this is confused by most operational semantics being "small-step", where the meaning is defined via structural recursion on the abstract grammar:

M ( ρ ) |- σ -> M (ρ') σ' { the meaning of an AST production is a transition function from an initial VM state to the meaning of a (simplified) production applied to a new VM state. }
Which end up looking very similar to denotational semantics as denotational semantics are normally defined via structural recursion.

But even for the similarity there remains the core distinction (as I understand it) - Denotational Semantics are defined in terms of a reduction-semantics from program+basis to a simplified program (although normally a program in a different, mathematically tractable language such as the lambda-calculus) whereas Operational Semantics are defined in terms of a transition-semantics from a program+initial-state to a new state (although normally a state extended with function-values defined in a mathematically tractable language such as the lambda-calculus).

Now of course Interpreter Semantics is a whole different kettle of fish, and of course leaves you facing the 'turtle problem' - you can try for 'turtles all the way down' if you like, but once you hit the metal you've pretty much lost the point of a semantics in the first place. I must admit admiring the way ECMAscript handled it - an interpreter semantics in SML, which has an operational semantics of its own avoiding the problem.

Thursday, August 30, 2007

RDF, Document Repositories

I have recently had several discussions with people working on/with various document repository projects. One thing I hear is the increasing understanding of the importance of semantics in repository metadata management. I'm naturally pleased to see this because storage, query, and management of semantics is precisely where projects like mulgara come in.

Now digital repositories are alot more than just a semantic store, there are also issues associated with the actual document storage, retrieval, and the QA, workflow, and various metadata extraction tasks. However these days most repository projects include some sort of semantic store to manage metadata, and the question is reasonably asked - why not just use that?

The reason I recommend considering using mulgara to augment a document repository is due to the additional flexibility gained by doing so. When a repository developer approaches the metadata problem they have a natural tendency to view adopt an 'instance-focused' view of metadata. This is where the focus is on the 'document', where each document is of a given 'document type', and each 'document type' implies corresponding 'attributes'. In contrast, RDF is 'property-focused' --- where the focus is on defining the 'meaning' of 'properties', and a 'document' having a given property 'classifies' the document as belonging to a given 'class of documents'.

While the two are duals they do influence design. If you are taking an instance-focused approach you will find yourself heading towards document-type hierarchies and defined attribute lists. If you take a property-focused approach you will find yourself defining lexicons and ontologies. The former tends towards relational approaches, the latter towards semantic approaches such as RDF. The reason why I believe the semantic approach to be superior is in its flexibility. Hierarchies are predicated on centralised control. Even if you can maintain central control over the document type definitions and their attributes, the very act of standardisation in this manner leaves your project with the unfortunate choice of 'scale' or 'responsive to change'.

RDF and its related standards allow the decentralisation of attribute/property definition, while providing the tools to manage the resulting 'mess'. With the combination of namespaces, RDFS, and assuming the provision of schema repositories, it becomes possible to allow global use and reuse of locally defined 'models'. This is especially relevant when you consider that 'local' should be considered in both temporal and spatial senses - keeping the system responsive both to the differing needs of independent organisations, and the changing requirements of a single entity.

The result is the need for two additional boxes in your architecture diagram. The first, a vocabulary server that allows users to define their own vocabulary extensions, and makes those definitions available to applications and critically capable of management and eventual rationalisation. The second is a metadata server that can store the resulting data and permit ad-hoc querying by applications, and the inferencing required by vocabulary rationalisation. One of the reasons mulgara exists is to provide the storage and query engine required by these components - and so I do enjoy having the chance to talk to people about it.

Wednesday, February 14, 2007

The Effect of File Sharing on Record Sales: An Empirical Analysis

Felix Oberholzer-Gee Harvard University and Koleman Strumpf University of Kansas

Abstract

For industries ranging from software to pharmaceuticals and entertainment, there is an intense debate about the appropriate level of protection for intellectual property. The Internet provides a natural crucible to assess the implications of reduced protection because it drastically lowers the cost of copying information. In this paper, we analyze whether file sharing has reduced the legal sales of music. While this question is receiving considerable attention in academia, industry, and Congress, we are the first to study the phenomenon employing data on actual downloads of music files. We match an extensive sample of downloads to U.S. sales data for a large number of albums. To establish causality, we instrument for downloads using data on international school holidays. Downloads have an effect on sales that is statistically indistinguishable from zero. Our estimates are inconsistent with claims that file sharing is the primary reason for the decline in music sales during our study period.

In the Journal of Political Economy.

Available in Volume 115, Number 1, February 2007

From Ars Technica via Lawrence Lessig

Tuesday, February 13, 2007

Java generics and the covariance and contravariance of arguments

Well given we require 1.5 now for other reasons, and 1.5 does complain if you don't constrain generic classes I have finally bitten the bullet and started using generics. Unfortunately I just got bitten by what I suspect is going to be a very common mistake - in this case by failing to properly consider the type equivalence of parametrised method calls.

Consider the following code:

public interface TestInterface { }

public class TestClass implements TestInterface { }

import java.util.ArrayList;
import java.util.List;

public class Test {
 private List<testclass> list;

 public TestInterface test() {
   list = new ArrayList<testclass>();
   list.add(new TestClass());

   return covariant(list);
 }

 public TestInterface covariant(List<testinterface> ilist) {
   return ilist.remove(0);
 }
}
Now there is absolutely no reason why this should not work. It is trivially inferable that the above code treats ilist as covariant in the list-type - and that therefore this code is statically correct.

Of course Java's typing has never been particularly smart. List<t1>.add(T1) is contra-variant in t1, and T2 List<t2>.get(int) is co-variant in t2; so the Java compiler is correct to infer that in the general case List<t1> and List<t2> are substitutable iff t1 == t2.

If we can't declare a generic parameter to be covariant in its type parameter we have a serious problem - it means that any non-trivial algorithm involving collections is going to run afoul of this. You might consider trying to cast your way around it:

 public TestInterface test() {
   list = new ArrayList<testclass>();
   list.add(new TestClass());

   return covariant((List<testinterface>)list);
 }
but not surprisingly that didn't work.
Test.java:11: inconvertible types
found   : java.util.List<testclass>
required: java.util.List<testinterface>
 return convariant((List<testinterface>)list);
                                       ^
1 error
If you continue to hack at it you might try a double cast via a non-generic List.
 public TestInterface test() {
   list = new ArrayList<testclass>();
   list.add(new TestClass());

   return covariant((List<testinterface>)((List)list));
 }
This works but leaves us with the unchecked/unsafe operation warning:
Note: Test.java uses unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Now this is a perfectly reasonable warning - it is unchecked; it is unsafe; and more importantly it does violate encapsulation. The problem here is that the caller should not be defining the type invariants of the callee - that's the job of the method signature!

The correct solution is to allow us to declare covariant() to be covariant in its argument; and fortunately Java does support this.

To declare an argument to be covariant in its type parameter you can use the extends keyword:

 public TestInterface covariant(List<? extends TestInterface> ilist) {
   return ilist.remove(0);
 }
To declare an argument to be contravariant in its type parameter you use the super keyword:
 public void contravariant(List<? super TestClass> clist, TestClass c) {
   clist.add(c);
 }
Without these two facilities generics would be badly broken, so I am glad Sun had the presence of mind to include them - btw if you are using Java 1.5 I strongly recommend you read the Java Generics Tutorial

As an aside it is worth noting that as Java includes a Top type 'Object', List is a common covariant type - sufficiently common that Sun has included a syntactic sugar for it, List. Personally I'm not sure this was such a good idea, List would work anyway, and I think I would prefer to have kept the covariance explicit.

Update: Corrected capitalisation error in initial java example.

Tuesday, February 06, 2007

Five things to consider when writing a Mulgara Resolver

Ended up writing a longer response than I had planned to a query about writing a resolver in Mulgara today. I'm putting it here to keep a handle to it as it does cover the basic structure of the resolve() method in reasonable detail.

First it is important to realise that resolvers *don't* return triples - they return Resolutions. These are Tuples that provide bindings for the variables in the Constraint passed to resolve(). So in the case of <http://www.example.com/path/subpath> $predicate $object the resulting Resolution should have two variables ($predicate $object). In the case of <../subpath> <http://www.schema.com#parent> $subpath it will have one ($subpath).

You should also be aware that a Resolution can be unevaluated! It is not uncommon for bindings, required to evaluate the constraint, come from other parts of the query. Consider the following where clause:

$url $p $o in <rmi://localhost/server1#sample> 
and 
<myfile> <hasurl> $url
in this case your resolver will be asked to resolve ($url $p $o), return a Resolution that will later be passed the $url in the prefix argument to beforeFirst(). Evaluation would then occur either in beforeFirst() or in the calls to next() - we prefer it to happen in beforeFirst if the memory requirement isn't unreasonable, our algorithmic reasoning assumes a comparatively cheap next().

If you require that a particular variable be bound prior to final evaluation then you need to provide a MandatoryBindingAnnotation - this indicates to the join logic that it must ensure a specific binding is satisfied by other constraints in the query before you are evaluated (in this case $url).

It is also worth noting that due to the support of intervals and the resulting interaction with query transformations, the XSDResolver is quite complicated as resolvers go. Without that a call to resolve consists of:

  1. Obtaining the model (constraint.getModel()).
  2. Do any preparatory work, especially any work that might be able to prove the result Empty (or a singleton).
  3. If you can't prove the result empty (or singleton), defer further evaluation to the returned Resolution.
Then inside the Resolution you need to consider how you implement the following four key methods
MandatoryBindingAnnotation
are there any variables that *must* be bound for the deferred evaluation to terminate.
DefinablePrefixAnnotation
can you cheaply reorder the variables in the result (log n or less)
ReresolvableResolution
can you cheaply reresolve the constraint if additional information becomes available (again log n or less) [note: this will become an Annotation like the other two in the Mulgara 1.2 dev-cycle]
beforeFirst()
you can ignore the suffixTruncation arg, but you can't ignore the prefix - these *are* the values of the first N variables of the resolution - if all the variables are passed as a prefix your only decision is 1 row or 0 - but most of the time you will be passed less than this.
At this point you have either performed the evaluation, or you have setup the evaluation and deferred the rest to be done incrementally on each call to next().
next()
does whatever is required to ensure that calls to getColumnValue().
There is only one Tuple class that defers evaluation beyond this point (the implementation of count()). Naturally we don't want to go to the effort of evaluating an entire subquery until the user actually goes to use it - so we defer evaluation of the count() until the call to getColumnValue().
getColumnValue()
normally this is a matter of returning values calculated in either beforeFirst() or next() - occasionally this amounts to evaluating it but this is uncommon.

The whole point of the Resolution/Tuples/beforeFirst/next bother is to implement lazy-evaluation in java. We only scale to bignum-levels when all query evaluation is done on a call-by-need basis.