Content Coordinate System¶
Caveat lector: HC SVNT DRACONES
Addressing Content¶
Suppose a document (e.g., a txt
file) is in the Text Repository, how can we then refer to portions of the text inside the document?
Let’s try some URIs.
Suppose that:
curl $prefix/documents/$id
will get you all of the document’s txt
contents. It would be nice if we could address a character range, e.g.,
from characters 10-15
in a manner such as:
curl $prefix/documents/$id/text/chars?start=10&end=15
or:
curl $prefix/documents/$id/text/chars?start=10&length=6
Similarly we could be interested in lines 2..4
at address:
curl $prefix/documents/$id/text/lines?start=2&end=4
The idea being that the text
part in $prefix/documents/$id/text/chars?start=10&length=6
signifies that we
are interested in interpreting the document from a txt
perspective, looking at lines, words, characters, etc.
(as opposed to, e.g., an XML / TEI context where we could be interested in getting the author from the metadata, …)
Things get interesting once we challenge ourselves to the idea that we could be interested in
viewing a document from this text
perspective, irrespective of the actual format that was used to upload
the document. So, even if a TEI
(or PageXML
, hOCR
, …) document is uploaded, we still want
to address its textual content via /text/chars/...
and we consider it a Text Repository responsibility to be
able to (in this case) yield the requested character range (lines, words, …) from the TEI
document.
Perspectives¶
In $prefix/documents/$id/text/{chars,lines,words,...}
we will call text
the perspective.
TODO: conjure up terminology for the {chars,lines,words}
part, perhaps selector
?
$prefix/documents/$id/<perspective>/<selector>?params
Sidestepping to current Text Repository implementation, perspective
could be a Resource class in the WebApp stack,
translating the URI / addressing scheme. Then a separate Perspective class hierarchy is responsible for mapping
various file formats to text (flatten the tree and yield the characters for a generic XML file, do something
more intelligent for a TEI file, use “Rutger’s” implementation for PageXML/hOCR, etc.). Then a Selector class
hierarchy works in the txt
domain and selects the requested fragment.
TODO: diagrams :-)