open scriptures

Open Scriptures sandbox API server

For the past few months I've been hosting a "sandbox" server for the Open Scriptures API. The purpose is to provide a proof-of-concept for the code, and to ensure that we know how to make the code work in production. It has been an adventure, since the code and its requirements have changed many times.

Tonight I reached an important milestone with the API server: I am now serving the API with a the full Tischendorf Greek New Testament text, using mysql as the database. Previously I had been only hosting the book of Jude, and only using sqlite for the database. So now I have a fully-functioning instance, with a full dataset, in a real production environment.

As an exmaple, you can visit here: http://api.ossandbox.info/texts/passage/Bible.Tischendorf:2Thess.1.1-2Thess.1.3 (that's 2 Thessalonians 1:1-3 in an XML format).

If you are interested in the project and would like to know more, check out our mailing list, and visit us on IRC in #openscriptures on irc.freenode.net.

(For the technically inclined, Open Scriptures is group of Django apps, bundled in Pinax. It is being served by Apache and mod_wsgi. Thanks especially to Patrick Altman and Brian Rosner for helping me get things up and running.)

Openscriptures Meet Up: The Prequel

In advance of the planned first-ever Open Scriptures Meetup (OSMU) in September (to coincide with Djangocon), a few of us will be meeting at OSCON to take in the State of the Onion Address and then get some food. We don't want to pre-empt the primacy of the actual first-ever OSMU, so we're going to call this one the Pre-OSMU OSMU. It's a test run.

The Oregon Convention Center

Cryptographic hashes and RESTful URIs

In a recent post to the Open Scriptures mailing list, it was suggested that we use md5 (or another cryptographic hash) to generate unique IDs for each token (a "token" is the fundamental unit of text (most often a word) in our API database models). Today we discussed the implementation of this on IRC, and it was fairly stimulating.

First of all, md5 is broken and deprecated, due to possible collisions (two different pieces of data can result in the same hash). Since we will be dealing with millions of tokens, we decided not to test our luck, unlikely though a problem may be. SHA-256 has no known collisions, so we decided it was best to use that algorithm.

SHA-256 is implemented in Python's standard library hashlib, so that is good. For exapmle:

>>> import hashlib
>>> hashlib.sha256("Hello world!").digest()
'\xc0S^K\xe2\xb7\x9f\xfd\x93)\x13\x05Ck\xf8\x891NJ?\xae\xc0^\xcf\xfc\xbb}\xf3\x1a\xd9\xe5\x1a'

Needless to say, such a digest would not be very good for use in a RESTful URI scheme. So, hashlib also offers a hexadecimal option:

>>> hashlib.sha256("Hello world!").hexdigest()
'c0535e4be2b79ffd93291305436bf889314e4a3faec05ecffcbb7df31ad9e51a'

That is still not the best, since that makes for a very long string. So, we have the option of using base64 encoding:

>>> import base64
>>> base64.b64encode(hashlib.sha256("Hello world!").digest())
'wFNeS+K3n/2TKRMFQ2v4iTFOSj+uwF7P/Lt98xrZ5Ro='

That is shorter, but it includes the "/" character, which is a no-no for URI design. Luckily base64 includes a function for this exact purpose: Read more »

Sunday Roundup

A few things of note:

  • A bunch of folks from the Open Scriptures project are hanging out in irc: #openscriptures on irc.freenode.net
  • The MorphGNT site is active and rumbling again.
  • James Tauber and Patrick Altman's οχλος is a tool for collabrotive corpus linguistics. The demo task provides an interface to enter morphological parsings on the gospel of John, and Tauber is even working on a cooler interface. I had wanted to launch something like this, but smarter people are taking care of it. It is official: this will be the coolest site on the web when it is done.
  • Speaking of active and rumbling, Kim and I visited Mt. St. Helens today.
  • OSCON is this week in Portland!

First Open Scriptures Hackfest

Over the long weekend I got together with Weston, and we did some work on Open Scriptures. This resulted in a storm of commits, and a much better code base. We have been fortunate enough to be joined by Patrick Altman, whose experience with Django led to some immediate improvements in the structure and functionality of the code. This in turn has inspired other work, and the project as a whole is moving a decent pace at the moment.

We are hanging out in #openscriptures on irc.freenode.net. If you are so inclined to drop in and learn more about the project, feel free to join the room.

What's better than a public web API

I am really excited for the Open Scriptures API project. It will create an open, efficient, federated system for querying scripture and metadata. And using that web API many useful "front-end" apps can be developed. But I like something better than the public web API: the internal Django API for the project. Sure, you can make use of the RESTful URL structure to make query writing very easy, but I would prefer to use the internal Python API for any apps I would write, which is just as easy to use in my opinion (though perhaps not as easy to learn if you are not familiar with Python and Django). Plus there is less abstraction involved, since it's all Python. So, if you someday want to write an application which makes use of the Open Scriptures API, don't forget that you can write a Django app with direct access to the internal API, and that will make for some very efficient and easy coding. I am curious to see if more applications end up making use of the public HTTP API or the internal one.

Open Scriptures Meet Up?

I suggested that we have an "OpenScripturesCon" in Portland some time. Then Weston suggested that we do it to co-incide with Django-con in September, and host it at Multnomah University. So, the OSMU (Open Scriptures Meet Up at Multnomah U) is (possibly) born. Django-con's dates are not nailed down yet, so there is no firm date for the event. It would be quite beneficial to meet with some expert Django developers (like James Tauber).  Let's keep our fingers crossed.

Open Scriptures talk from BibleTech 2010

Weston Ruter explains the Open Scriptures API project. Also take a look at James Tauber's talk: A New Kind of Graded Reader.

Derived classes: not so fast

For the Open Scriptures API I was trying to find an elegant way to store parsing information in a language-agnostic way in a Django model. To achieve that end, I chose to use derived classes. My thought was that we could use a generic class (TokenParsing) as an abstraction layer through which the language-specific parsing models could be connected to our metadata class (TokenMeta). Those parsings could then be retrieved using this bit of magic:

TokenParsing.objects.get(tokenmeta = self)

I thought that this would return a list of all the objects subclassed TokenParsing which linked to the metadata object. However, when subclassed objects are created in Python and stored in Django, it actually creates two instances, one of the base class, and one of the derived class. So this function returns only the base class objects, which have no useful information in them. In other words, the derived, language-specific parsing class (e.g. TokenParsing_grc) can have a link to the TokenMeta object, but there is no way to traverse that link backwards from the TokenMeta object. Without that, the API would not be able to query metadata properly. I poked around at other possible methods for achieving this (e.g. abstract base classes), but none of them would achieve my goal. So I have decided to revert to the original method. But as it turns out, that is not a bad thing. In my quest to find a way to express parsings in a language-agnostic way, I overlooked one glaring problem: a language-agnostic solution would be useless. This came home when I was searching for a solution to this problem: Read more »

Morphological v. Semantic Parsing and Databases

I proposed an initial Django model for storing Greek parsing data in the Open Scriptures mailing list and it has generated a good amount of discussion. The central question is whether we should follow traditional yet problematic morphological parsing paradigm, or whether we should seek to implement a semantic paradigm. Mike Aubrey has written some good posts on the problems with the traditional paradigm (e.g. Robertson on the middle and passive voice). Luckily with Django we can have an arbitrary number of parsing models for any given word. So from a technical standpoint, it is not a question of which model, so long as that model can be sensibly reduced to database fields. From a grammatical point of view, I have mixed feelings. I think that there are some real problems with the traditional system, especially in terms of its terminology and treatment of "tense" and voice. I think there is some value in purely morphological descriptions (especially insofar as they provide an objective description of the word), but that should not be the end-all of understanding a word. And I tend to agree  that the introduction of a new technology paradigm (i.e. the Open Scriptures API) may be a good time to introduce new parsing paradigms. Still, most people who have learned Greek are rooted in the traditional paradigm, so Open Scriptures should contain parsing information they can understand. Also, there are many existing datasets using the traditional paradigm which we would like to import and utilize. So I think it best for Open Scriptures to be able to store the morphological parsings (though not to the exclusion of other paradigms). Read more »

Syndicate content