-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use JavaScript for PDF conversion and OCR #80
Comments
Interesting idea. Since the Zotero PDF viewer is based on pdf.js (as far as I understand), does this mean we're actually getting part of the pipeline out of the box? |
Yes, I think so (I did not notice that pdf.js is already part of Zotero, so the suggested first step would not increase the size of zotero-ocr). The code of pdf2text-ocr should show the requires steps to get the input for Tesseract from a PDF file. |
On my MacBook |
claude.ai suggests this code:
Here's how the code works:
You can use this function in your Zotero plugin to extract the image data for each page of a scanned PDF file. The resulting Remember that this code assumes you have access to the PDF file URL. If you need to work with a PDF file stored in the Zotero user's library, you'll need to use the appropriate Zotero APIs to retrieve the file path or URL. |
Currently zotero-ocr requires additional installation steps for
pdftoppm
andtesseract
.Both could be replaced by pure JavaScript implementations which could be included in zotero-ocr to simplify the installation:
In a first step we could start with pdf.js.
The text was updated successfully, but these errors were encountered: