Content Coordinate System

Caveat lector: HC SVNT DRACONES

Addressing Content

Suppose a document (e.g., a txt file) is in the Text Repository, how can we then refer to portions of the text inside the document?

Let’s try some URIs.

Suppose that:

curl $prefix/documents/$id

will get you all of the document’s txt contents. It would be nice if we could address a character range, e.g., from characters 10-15 in a manner such as:

curl $prefix/documents/$id/text/chars?start=10&end=15

or:

curl $prefix/documents/$id/text/chars?start=10&length=6

Similarly we could be interested in lines 2..4 at address:

curl $prefix/documents/$id/text/lines?start=2&end=4

The idea being that the text part in $prefix/documents/$id/text/chars?start=10&length=6 signifies that we are interested in interpreting the document from a txt perspective, looking at lines, words, characters, etc. (as opposed to, e.g., an XML / TEI context where we could be interested in getting the author from the metadata, …)

Things get interesting once we challenge ourselves to the idea that we could be interested in viewing a document from this text perspective, irrespective of the actual format that was used to upload the document. So, even if a TEI (or PageXML, hOCR, …) document is uploaded, we still want to address its textual content via /text/chars/... and we consider it a Text Repository responsibility to be able to (in this case) yield the requested character range (lines, words, …) from the TEI document.

Perspectives

In $prefix/documents/$id/text/{chars,lines,words,...} we will call text the perspective.

TODO: conjure up terminology for the {chars,lines,words} part, perhaps selector?

$prefix/documents/$id/<perspective>/<selector>?params

Sidestepping to current Text Repository implementation, perspective could be a Resource class in the WebApp stack, translating the URI / addressing scheme. Then a separate Perspective class hierarchy is responsible for mapping various file formats to text (flatten the tree and yield the characters for a generic XML file, do something more intelligent for a TEI file, use “Rutger’s” implementation for PageXML/hOCR, etc.). Then a Selector class hierarchy works in the txt domain and selects the requested fragment.

TODO: diagrams :-)