Discovering IIIF Manifests with Pictor

Discovering IIIF Manifests with Pictor

Idea

Discovering IIIF resources can be challenging.

Although the protocol does specify a dedicated Discovery API it is not often implemented by institutions. (At Anet we are guilty of the same). Moreover, this API has no straightforward way to obtain a full collection. It is certainly not as straightforward as with OAI-PMH for instance, that offers the verb ListIdentifiers.

The IIIF documentation does have an interesting Guide to finding IIIF resources, which features a list of IIIF collections. Similar sources are:

IIIF Discovery Registry
Biblissima IIIF Collections - Manuscripts & Rare Books
Inventory of IIIF map collections
Digitized Medieval Manuscript Database (which allows for filtering by IIIF compliancy)
The IIIF Universe

With that information I was able to scrape several of these collections and aggregate them into a corpus of about 6.5 million IIIF manifests. The resulting lists are available in this repository.

This repository has two purposes. One it offers a place to store IIIF collections and make them available for others. Two it uses those collections to host a discovery tool with a sample of them.

Currently, it features manifests of the following institutions / collections:

Allmaps
Anet library network
Art Institute of Chicago
Badische Landesbibliothek Karlsruhe
Bayerische Staatsbibliothek (BSB) / Munich Digitization Centre (MDZ)
Biblioteca Apostolica Vaticana
Biblioteca Digital de Cuba
Biblioteca Nacional de Portugal - Biblioteca Nacional Digital
Biblioteca Virtual of the Banco de la República de Colombia
BVMM (IRHT-CNRS) (* incomplete sample, due to connectivity issues)
Digital Bodleian
Digital Collections (Leiden University Libraries)
Digital Commonwealth
Digitales Brandenburg - University Library Potsdam
Digitale Sammlungen Universität Bremen
Digitale Sammlungen Universität Bonn
Digitale Sammlungen Universität Düsseldorf
Digitale Sammlungen Universität Frankfurt
Diözesan- und Dombibliothek Köln
E-codices
The Frick Collection
Getty Institute
Göttinger Digitalisierungszentrum
Gouda Time Machine
Harvard Art MUseums
Iberoamerikanisches Institut Berlin
IIIF Universe
Internet Archive
Universitätsbibliothek Heidelberg
Landesbibliothek Oldenburg Digital
Manuscriptorium
Metropolitan Museum of Art Publications
Mmmonk
Museum-digital
National Archives of Sweden
Patrimonio Digital Complutense
Parker Library On the Web (Cambridge)
Princeton University Library
Scholastic Commentaries and Texts Archive
Staatsbibliothek Berlin
Universitätsbibliothek Leipzig
Universität Halle
University College Dublin Digital Library
University of Toronto
Villanova Digital Library
Wikidata
Wellcome Collection
World Digital Library
Yale Center for British Art
Yale Peabody Museum of Natural History (* incomplete in repository, due to lack of time)
Yale University Art Gallery
Zeitungsportal NRW

(* = No sample in the discovery tool yet)

Harvesting

Harvesting the IIIF manifests was done with Python scripts in a variety of ways.

Many institutions, like the Bayerische Staatsbibliothek or the University of Toronto let you scrape collections from their Presentation API. Others, like Digital Commonwealth have OAI-PMH that get you the necessary identifiers. Still others, like the Getty Institute or Wikidata offer a SPARL endpoint.

I harvested all manifests I could find for the repository and also made random 1,000 manifest samples of the collections for the discovery tool. For this, good old Unix tools are still amazingly good:

sort digitalcommonwealth.txt | uniq | shuf | head -n 5000 > digitalcommonwealth_sample.txt

(Update 24 November 2022: at this point in the project, I was forced to switch from 5K to 1K samples from these collections, because of the sheer volume of the material. So at the moment the discovery tool's results are somewhat skewed because some collections are more represented than others. I plan to remedy this in the future with a full re-run, but it will be a while before I get round to this)

Indexing

Requesting and indexing the IIIF manifests was done with a Go script (since Go is really strong for concurrency) and the result was piped as triple statements from stdout to a plain textfile. (I found this a handy alternative to having to set up a database like PostgreSQL or something similar that could handle concurrent writing).

The resulting triple store was then turned into a number of JSON files, including one for the IIIF manifest identifiers and their matching sequential number. I used base-85 numbers for the latter, as this gave me a very efficient way to encode large numbers.

In total, this process only takes a couple of hours for the current sample of ca. 80,000 manifests.

Workflow, after harvesting and sampling into *_sample.txt files

mv *_sample.txt ../indexer/corpus
cd indexer
./build.sh
./pictor >> db.txt
python3 jsonify.py

Web application

Finally, a web interface with some JavaScript allows to enter one or several keywords which are then looked up in the index. The resulting matches are presented as IIIF thumbnails, together with the manifest URL and the label metadata. A random selection of keywords is also present.

I also note that this is a completely serverless application, which hosts the necessary JSON statically and reads them into the browser memory upon loading the page. Obviously, this approach has its limitations, but, as with my Ulpia project, the benefits of not having to spin up a server for this tool outweigh the disadvantages.

Technical remarks

Not only do institutions seem to neglect IIIF discovery somewhat, several of the APIs I used, suffered frustrating hiccups like timeouts, refused connections or faulty resumption tokens. When I first started working on this, it seemed as if some collections even actively tried to limit scraping or crawling, but to be fair, there was usually a technical issue and several of the institutions I contacted, replied in a really constructive and helpful way.
Parsing IIIF manifests (both version 2 and 3 manifests are current) with Go has taught me that a lot of institutions seem to implement their own interpretation of the API rather than follow the specifications. Mandatory fields are left out, fields have different data formats (strings instead of arrays and such), and so on.
I did some experiments with SQLite as a database backend for this application and for the requesting/indexing phase. The first, inspired by the recent sqlite3 WASM/JS functionality, I just could not get up and running. The second, I found out, is not a viable option. Even if you insert data into SQLite concurrently with Go routines, SQLite apparently forces everything to sequential writing? Not sure about this info, though...

Wild plan and call to action

Finally, some daydreaming. I made the discovery tool for a sample of the manifests I have collected, but what I would really like to do is push the limits and see how many manifests I can process and still host the index on a static webpage. Currently, for ca. 80,000 manifests, the JSON files are only slightly above 25 MB in total, so this could definitely be scaled up.

So if you or your instution want to participate in this experiment, or simply deposit your IIIF manifests in the central repository, please get in touch with me.

Acknowledgements

Since first publishing this project, many people have reached out with kind comments and useful suggestions. As a result, Pictor has become a better and more comprehensive tool!

Special thanks go to Etienne Posthumus, Bob Coret, Alexander Winkler, Glen Robson, Jules Schoonman, Johannes Baiter, Eduardo Fernández, Mek, Jörg Lehmann, Jolan Wuyts and anyone else I might forget...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Discovering IIIF Manifests with Pictor

Idea

Harvesting

Indexing

Web application

Technical remarks

Wild plan and call to action

See also

Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Discovering IIIF Manifests with Pictor

Idea

Harvesting

Indexing

Web application

Technical remarks

Wild plan and call to action

See also

Acknowledgements