Skip to content
This repository has been archived by the owner on Sep 4, 2024. It is now read-only.

metadata prep fails #55

Open
chirila opened this issue Nov 3, 2019 · 1 comment
Open

metadata prep fails #55

chirila opened this issue Nov 3, 2019 · 1 comment
Labels
bug possible software bug or change

Comments

@chirila
Copy link
Collaborator

chirila commented Nov 3, 2019

This file does not exist, I don't know why it should be listed in the keys (nothing has been deleted to my knowledge since /output was created)


KeyError Traceback (most recent call last)
in ()
6 bn = os.path.basename(i)
7 parent_bn = '-'.join(bn.split('-')[:-1]) + '.jpg'
----> 8 fig_to_meta[bn] = deepcopy(img_to_meta[parent_bn])
9 fig_to_meta[bn].update({
10 'image': bn,

KeyError: '6cfbdefa-f02d-11e9-a1e4-a0999b1b3fb3.jpg'

@chirila chirila added the bug possible software bug or change label Nov 3, 2019
@duhaime
Copy link
Member

duhaime commented Nov 4, 2019

Interesting, this is a little tricky to diagnose because there are a few moving parts in this system. From what I can tell, though, that image is present in ~/cc/voynich/data/morgan/images/6cfbdefa-f02d-11e9-a1e4-a0999b1b3fb3.jpg. But it seems that image is not present in img_to_meta (hence the key error).

There are a few reasons why that image might be missing from that metadata map. This notebook assumes that all cells are run in linear order from top to bottom, and also assumes that the disk is static aside from the operations performed by the notebook itself. If either of those aren't true then there could be data mapping problems like the one above.

For the present, would it be alright to try one push through the pipeline with an absolutely minimal dataset? I would start by moving everything in voynich/data to some other location on disk, then just add a small sample collection to voynich/data. That should come through the pipeline just fine (I've just processed a small collection locally, and updated some of the values in the notebook to ward against wonky data). If you have any troubles with the small collection, just let me know and we'll be able to investigate much more easily than with the massive pipeline.

Longer-term, there are a few options worth considering. Right now this notebook is pretty "open" and flexible, rather than closed and robust. That seemed more appropriate for a research-oriented task, but it may well be that a more closed and robust pipeline is more appropriate for the large scale of data that we now want to process.

To transition toward a more closed system, we could propose the following: It wouldn't be too difficult to factor the voynich notebooks so as to make one resource that partitions images into figures and creates metadata for each figure. Then those outputs could be used with the neural neighbors data pipeline, which will be more robust. The vectorization strategy of the nn pipeline is entirely isolated, and can be changed in place very easily. One could e.g. use code like the convolutional autoencoder in the voynich notebook to train a custom model, persist those model weights, then load them into the vectorizer in the nn pipeline. This would allow one to remove most of the brittle data lookups in this notebook, such as the one that threw the error in this issue...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug possible software bug or change
Projects
None yet
Development

No branches or pull requests

2 participants