Monday, March 11, 2013

Configuring Sesame 2.6.10 data-directory without using System properties

Thanks to Peter Ansell who helped me track this undocumented configuration option in Sesame.

In the openrdf-http-server-servlet.xml file the adunaAppConfig bean (class: can accept a dataDir property with the value of the data-directory.

I haven't found this documented anywhere, so I hope this helps someone. Probably that person will be me next time I need to configure Sesame.

The fragment in question:

    <bean class="" 
             destroy-method="destroy" id="adunaAppConfig" 
        <property name="applicationId" value="OpenRDF Sesame">
        <property name="longName" value="OpenRDF Sesame">
        <property name="version" ref="adunaAppVersion">
        <property name="dataDirName" value="/var/lib/aduna">

Wednesday, November 25, 2009

JRuby and JNI and OS X.

Like practically everyone I sat down a few years back and picked up enough Ruby to write an address-book app in Rails. Nice enough language, a few nice features[0]. Still at the end of the day... YADTL, so ultimately YAWN[1].

All this is to say that I found myself yesterday facing the need to convert a trivial rails app into a java servlet, with the added complication that it uses a third-party C++ library to do the bulk of the work.

JRuby is a trivial install, and the application in question didn't make use of continuations, so rip out all the Rails code, and replace with a simple loop that takes requests from the cmdline instead of a POST parameter, and we're in business.

Well up to the point where it falls over because it needs the C++ library. Here's the first real hurdle. Apple's JDK for OS X is compiled as a 64-bit binary which can't link against 32-bit libraries[2], and by default the C compiler only produces 32-bit libraries.

$ file /usr/local/lib/libcrfpp.dylib 
/usr/local/lib/libcrfpp.dylib: Mach-O dynamically linked shared library i386
A quick check on google yields these pages that suggest that I need to add -arch i386 -arch x86_64 to the CXXFLAGS and LDFLAGS. So after modifying the Makefile to compile the library with the above flags and install it in /opt/local we have:
$ file /opt/local/lib/libcrfpp.dylib 
/opt/local/lib/libcrfpp.dylib: Mach-O universal binary with 2 architectures
/opt/local/lib/libcrfpp.dylib (for architecture i386): Mach-O dynamically linked shared library i386
/opt/local/lib/libcrfpp.dylib (for architecture x86_64): Mach-O 64-bit dynamically linked shared library x86_64

Then we need to compile the SWIG generated JNI wrapper, which will also require the above parameters, but also needs to link against Apples JNI installation. Back to Google which provides this page from Apple that describes how to compile JNI on Darwin, which provides the include path and an additional link flag.

The command lines to compile the wrapper in question becomes:

g++ -arch i386 -arch x86_64 -c -I/System/Library/Frameworks/JavaVM.framework/Headers CRFPP_wrap.cxx
g++ -arch i386 -arch x86_64 -dynamiclib -L/opt/local/lib -lcrfpp -o libCRFPP.jnilib CRFPP_wrap.o -framework JavaVM

Wanting to avoid calling System.loadLibrary from ruby code I wrote a quick wrapper class that could make the call in a static block:

package crfppwrapper;

import org.chasen.crfpp.Tagger;

public class CRFPP {
    public static Tagger newTagger(String args) {
        return new Tagger(args);

    static {
Finally modify the relevant ruby file to link to the jar file:
require 'java'
require 'lib/CRFPP-0.53.jar'
and call the wrapper:

Finally modify the toplevel jruby script to pass the necessary java parameters to the jvm:

#!/usr/bin/env jruby -J-Djava.library.path=/opt/local/lib -J-cp .:lib/CRFPP-0.53.jar -w

And it worked. No more Rails; migrated to jruby; and ready to bundle into a trivial servlet that will reintroduce the POST parameter to the mix.

[0] The implicit closure parameter to every function and an elegant brace syntax for creating the closures themselves does lend itself to some wonderfully elegant idioms.

[1] Of course if you were coming from J2EE you were probably more than happy to exchange static-typing for configuration-by-convention.

[2] And all of a sudden I'm having flashbacks to horror of IRIX binary toolchain management in the late 90's.

Tuesday, September 08, 2009

ldd equivalent under Darwin (OS X)

ldd is an incredibly useful command under linux. It lists the shared-libraries required by an executable, and exactly which .so file each dependency is currently resolved to.

A similarly useful tool is od, which permits inspection of object files, especially executable and shared-libraries, extracting things like the string-table, symbol-table, and various headers and sections.

Under Darwin these tools are combined in otool. ldd can be duplicated by otool -L.

Wednesday, August 26, 2009

Gov2.0 and Open-Source - a response

Cross-posted from and lessons from the open-source world at the Gov2.0 Taskforce blog.

My biggest worry is that the government's response to this initiative will be the announcement of some $multi-million grand-gesture. Big press-conference; Minister announcing the 'grand vision'; and the possible benefits we could see, lost in the maze that is large-scale government procurement.

The key insight of CatB is the extent to which redundancy is a benefit in exploratory development.

For the moment, we have no idea of the correct model for Gov2.0 - we have some understanding of what has worked outside of government, and a few promising avenues of approach, but no actual answers.

So I think we want to recommend that different Agencies experiment with different approaches and that the OIC be tasked with:

  1. Examining the success/failure of the different attempts, and eventually start to help agencies improve the success rate.
  2. Ensuring that the legal and regulatory requirements for aggregation and interoperability of/between these different services is standardised, as these are the issues that will derail bazaar development.
  3. Acting as a central clearing house where agencies/organisations/individuals can choose to self-publish interface descriptions, custom schema, metadata element schemes, vocabularies etc
  4. Providing a collaboration and mediation service to allow the reconciliation of conflicting interface/schema/scheme/vocab's.

The result would hopefully be a myriad of exploratory projects, some of which would fail, most of which would be ho-hum, but many of which would succeed.

The OIC would act as an institutional memory, learning and recording the lessons learnt; an institutional coordinator, making sure that people wanting to aggregate/integrate the different data-sources aren't forbidden from doing so; and an institutional mediator, assisting the different projects in finding and working together when they would like to.

Please post any comments to the gov2.0 site listed above, not here

Monday, July 13, 2009

Google’s new OS could hit Microsoft where it hurts

Quoted in a blog post by Andy Goldberg, Andy provides a quote that I suspect a lot of Microsoft sycophants will be telling themselves over the next year:

"Google may or may not have the experience and capability of actually producing an operating system and getting it deployed," he said. "It may not realise how hard it is."

Anyone who takes this line should remind themselves: Chrome OS is not an operating system! - it is browser-based windowing system running on top of an open-source OS; and Google can definitely handle that. Moreover, google has spent the better part of the past decade doing OS design and implementation. It's just that they haven't been selling it, they have been using it internally.

Friday, July 10, 2009

How to Write an Equality Method in Java

Martin Odersky, Lex Spoon, and Bill Venners have written an article on How to Write an Equality Method in Java. Given the amount of traffic my post on the same topic has attracted, I thought I might take a look. Their solution is the best I have seen, and well worth a look, but the article suffers the same problem as every article I've read on this topic: it fails to properly define equals().

A Java equals() method should implement a test for bisimulation.

To me, the remarkable thing about the Artima article is that it provides and example where the traditional getClass() approach I have previously recommended fails to implement bisimulation. The good-news is that this is a rather artificial use-case, and as such isn't going to be causing many bugs. The bad-news is that the fix requires a helper method defined at the topmost concrete class, and the need for it isn't intuitive.

Anyway, if you program in Java you need to read and understand this article.

Thursday, July 02, 2009

Scala and Various Things.

Not quite ready to move across yet - however capturing some links to make life that little bit easier when I do. I'm hoping these approaches will work as well for Elmo/OTM as they do for Hibernate.

I'll be updating this with any more links I want to save for a sunny day.

A couple more links worthy of rereading:

Wednesday, June 17, 2009

Looking for JSON, ReST, (and in-memory RDF) frameworks

Currently writing a number of small web-services to do various informatics tasks (more detailed post to come). Fortunately I'm not the one having to deal with 3rd-party SOAP apis! Still I do need to do various XML and JSON parsing, and not having addressed the latter before I've gone looking for libraries.

Currently I am about to start using Jackson, but was wondering if anyone had any warnings, advice, or recommended alternatives? In the course of looking at what was out there I have also come across Restlet, a ReST framework that seems like it is well worth the time to figure out and deploy, so I will probably be doing that soon as well, any warnings or advice on this will be welcome.

One of the nice things about Restlet is its support for RDF. Granted it doesn't support querying, and the terminology in the interface is a bit confused, but it does use its native Resource interface for URIRefs, so it should integrate well. OTOH, if it does prove useful as a ReST framework, I can see myself writing a quick Sesame or Mulgara extension, as there is only so much you can do with RDF before you need a query and/or data-binding interface.

Thursday, June 04, 2009

Deploying to tomcat using maven

This is a note to capture the process I am currently using to build and deploy a basic web application to tomcat for development.

  1. Create simple web app using archetype.
    mvn archetype:create ... -DarchetypeArtifactId=maven-archetype-webapp
  2. Add server to ~/.m2/settings.xml
  3. Add server to toplevel pom.xml
  4. compile, test, deploy, and subsequently redeploy
      $ mvn compile
      $ mvn test
      $ mvn tomcat:deploy
      $ mvn tomcat:redeploy

Of course what I would dearly love to know is how you configure a server url in the settings.xml file. The documentation I can find ( describes how to use a custom url, and how to configure authentication information. What it doesn't do is explain how you can do both at the same time, and all my attempts have resulted in XML validation errors when trying to run maven --- if I figure it out I'll update this post.

Wednesday, May 27, 2009

ANTLR: an exercise in pain.

Ok, so for my current project I need to either build heuristic or machine learning based fuzzy parser. As someone who has written numerous standard parsers before, this qualifies as interesting. Current approaches I'm considering include cascading concrete grammars; stochastic context-free grammars; and various forms of hidden-markov-model based recognisers. Whatever approach ends up working best, the first stage for all is a scanner.

So I start building a jflex lexer, and make reasonable progress when I find out that we are already using ANTLR for other projects, so I should probably use it as well. Having experienced mulgara's peak of 5 different parser generators - this does eventually become ridiculous - I was more than willing to use ANTLR. Yes it's LL, and yes I have previously discussed my preference for LR; but, it does support attaching semantic actions to productions, so my primary requirement of a parser-generator is met; and anyway, it has an excellent reputation, and a substantial active community.

What I am now stunned by, is just how bad the documentation can be for such a popular tool. One almost non-existent wiki; a FAQ; and a woefully incomplete doxygen dump, does not substitute for a reference. ANTLR has worse documentation than sablecc had when it consisted of an appendix to a masters thesis!

My conclusion: If you have any choice don't use ANTLR. For Java: if you must use LL I currently recommend JavaCC; if you can use an LALR parser do so, my current preference is Beaver.

Monday, May 19, 2008

So why can't google provide html for 40% of pdfs?

A google search for opensource filetype:pdf returns the standard 10 results on the front page, but only 6 of them offer a "View as HTML" link. Is it just me, or has this become more prevalent recently? And what is the common property that results in this behaviour?

If anyone has any clues or ideas I would love to hear them.

Tuesday, January 15, 2008

Now this is how conference videos should be presented

A great collection of conference presentations and interviews (mostly on java topics) Parleys - but of particular interest to me is the presentation, the best I have seen.

Thursday, November 29, 2007

How to start thinking about RDF.

While trying to understand RDF, its capabilities and its limitations I have done a lot of reading of earlier work on semi-structured data management. It occurs to me that I haven't blogged a paragraph from an 1995 paper on heterogeneous information sources that really crystalised for me the crucial difference between RDF and previous attempts at semi-structured data (especially XML and friends) - better late than never I suppose...

We need note define in advance the structure of an object ... no notion of a fixed schema or object class. A label can play two roles: identifying an object (component) and identifying the meaning of an object (component). If an information source exports objects with a particular label, then we assume that the source can answer the question "What does this label mean?". It is particularly important to note that labels are relative to the source that exports them. That is, we do not expect labels to be drawn from an ontology shared by all information sources.

Papakonstantinou, Y. et al. "Object Exchange Across Heterogeneous Information Sources", 11th Conference on Data Engineering, IEEE Computer Society, 251-260, 1995.

The actual framework presented in the paper is focused on intraorganisational integration and so doesn't have the property of anarchic scalability required if it is to be applied to the internet - however it does express clearly this need for source-defined semantics if semi-structured data is to be tractable.