Discovering IIIF resources can be challenging.
Although the protocol does specify a dedicated Discovery API it is not often implemented by institutions. (At Anet we are guilty of the same). Moreover, this API has no straightforward way to obtain a full collection. It is certainly not as straightforward as with OAI-PMH for instance, that offers the verb ListIdentifiers
.
The IIIF documentation does have an interesting Guide to finding IIIF resources, which features a list of IIIF collections. Similar sources are:
- IIIF Discovery Registry
- Biblissima IIIF Collections - Manuscripts & Rare Books
- Inventory of IIIF map collections
- Digitized Medieval Manuscript Database (which allows for filtering by IIIF compliancy)
- The IIIF Universe
With that information I was able to scrape several of these collections and aggregate them into a corpus of about 6.5 million IIIF manifests. The resulting lists are available in this repository.
This repository has two purposes. One it offers a place to store IIIF collections and make them available for others. Two it uses those collections to host a discovery tool with a sample of them.
Currently, it features manifests of the following institutions / collections:
- Allmaps
- Anet library network
- Art Institute of Chicago
- Badische Landesbibliothek Karlsruhe
- Bayerische Staatsbibliothek (BSB) / Munich Digitization Centre (MDZ)
- Biblioteca Apostolica Vaticana
- Biblioteca Digital de Cuba
- Biblioteca Nacional de Portugal - Biblioteca Nacional Digital
- Biblioteca Virtual of the Banco de la República de Colombia
- BVMM (IRHT-CNRS) (* incomplete sample, due to connectivity issues)
- Digital Bodleian
- Digital Collections (Leiden University Libraries)
- Digital Commonwealth
- Digitales Brandenburg - University Library Potsdam
- Digitale Sammlungen Universität Bremen
- Digitale Sammlungen Universität Bonn
- Digitale Sammlungen Universität Düsseldorf
- Digitale Sammlungen Universität Frankfurt
- Diözesan- und Dombibliothek Köln
- E-codices
- The Frick Collection
- Getty Institute
- Göttinger Digitalisierungszentrum
- Gouda Time Machine
- Harvard Art MUseums
- Iberoamerikanisches Institut Berlin
- IIIF Universe
- Internet Archive
- Universitätsbibliothek Heidelberg
- Landesbibliothek Oldenburg Digital
- Manuscriptorium
- Metropolitan Museum of Art Publications
- Mmmonk
- Museum-digital
- National Archives of Sweden
- Patrimonio Digital Complutense
- Parker Library On the Web (Cambridge)
- Princeton University Library
- Scholastic Commentaries and Texts Archive
- Staatsbibliothek Berlin
- Universitätsbibliothek Leipzig
- Universität Halle
- University College Dublin Digital Library
- University of Toronto
- Villanova Digital Library
- Wikidata
- Wellcome Collection
- World Digital Library
- Yale Center for British Art
- Yale Peabody Museum of Natural History (* incomplete in repository, due to lack of time)
- Yale University Art Gallery
- Zeitungsportal NRW
(* = No sample in the discovery tool yet)
Harvesting the IIIF manifests was done with Python scripts in a variety of ways.
Many institutions, like the Bayerische Staatsbibliothek or the University of Toronto let you scrape collections from their Presentation API. Others, like Digital Commonwealth have OAI-PMH that get you the necessary identifiers. Still others, like the Getty Institute or Wikidata offer a SPARL endpoint.
I harvested all manifests I could find for the repository and also made random 1,000 manifest samples of the collections for the discovery tool. For this, good old Unix tools are still amazingly good:
sort digitalcommonwealth.txt | uniq | shuf | head -n 5000 > digitalcommonwealth_sample.txt
(Update 24 November 2022: at this point in the project, I was forced to switch from 5K to 1K samples from these collections, because of the sheer volume of the material. So at the moment the discovery tool's results are somewhat skewed because some collections are more represented than others. I plan to remedy this in the future with a full re-run, but it will be a while before I get round to this)
Requesting and indexing the IIIF manifests was done with a Go script (since Go is really strong for concurrency) and the result was piped as triple statements from stdout to a plain textfile. (I found this a handy alternative to having to set up a database like PostgreSQL or something similar that could handle concurrent writing).
The resulting triple store was then turned into a number of JSON files, including one for the IIIF manifest identifiers and their matching sequential number. I used base-85 numbers for the latter, as this gave me a very efficient way to encode large numbers.
In total, this process only takes a couple of hours for the current sample of ca. 80,000 manifests.
Workflow, after harvesting and sampling into *_sample.txt files
mv *_sample.txt ../indexer/corpus
cd indexer
./build.sh
./pictor >> db.txt
python3 jsonify.py
Finally, a web interface with some JavaScript allows to enter one or several keywords which are then looked up in the index. The resulting matches are presented as IIIF thumbnails, together with the manifest URL and the label metadata. A random selection of keywords is also present.
I also note that this is a completely serverless application, which hosts the necessary JSON statically and reads them into the browser memory upon loading the page. Obviously, this approach has its limitations, but, as with my Ulpia project, the benefits of not having to spin up a server for this tool outweigh the disadvantages.
-
Not only do institutions seem to neglect IIIF discovery somewhat, several of the APIs I used, suffered frustrating hiccups like timeouts, refused connections or faulty resumption tokens. When I first started working on this, it seemed as if some collections even actively tried to limit scraping or crawling, but to be fair, there was usually a technical issue and several of the institutions I contacted, replied in a really constructive and helpful way.
-
Parsing IIIF manifests (both version 2 and 3 manifests are current) with Go has taught me that a lot of institutions seem to implement their own interpretation of the API rather than follow the specifications. Mandatory fields are left out, fields have different data formats (strings instead of arrays and such), and so on.
-
I did some experiments with SQLite as a database backend for this application and for the requesting/indexing phase. The first, inspired by the recent sqlite3 WASM/JS functionality, I just could not get up and running. The second, I found out, is not a viable option. Even if you insert data into SQLite concurrently with Go routines, SQLite apparently forces everything to sequential writing? Not sure about this info, though...
Finally, some daydreaming. I made the discovery tool for a sample of the manifests I have collected, but what I would really like to do is push the limits and see how many manifests I can process and still host the index on a static webpage. Currently, for ca. 80,000 manifests, the JSON files are only slightly above 25 MB in total, so this could definitely be scaled up.
So if you or your instution want to participate in this experiment, or simply deposit your IIIF manifests in the central repository, please get in touch with me.
A very similar initiative to Pictor is the Simple IIIF Discovery by the National Gallery.
Since first publishing this project, many people have reached out with kind comments and useful suggestions. As a result, Pictor has become a better and more comprehensive tool!
Special thanks go to Etienne Posthumus, Bob Coret, Alexander Winkler, Glen Robson, Jules Schoonman, Johannes Baiter, Eduardo Fernández, Mek, Jörg Lehmann, Jolan Wuyts and anyone else I might forget...