Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use JavaScript for PDF conversion and OCR #80

Open
stweil opened this issue Sep 8, 2024 · 4 comments
Open

Use JavaScript for PDF conversion and OCR #80

stweil opened this issue Sep 8, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@stweil
Copy link
Member

stweil commented Sep 8, 2024

Currently zotero-ocr requires additional installation steps for pdftoppm and tesseract.

Both could be replaced by pure JavaScript implementations which could be included in zotero-ocr to simplify the installation:

In a first step we could start with pdf.js.

@stweil stweil added the enhancement New feature or request label Sep 8, 2024
@aborel
Copy link
Collaborator

aborel commented Sep 8, 2024

Interesting idea. Since the Zotero PDF viewer is based on pdf.js (as far as I understand), does this mean we're actually getting part of the pipeline out of the box?

@stweil
Copy link
Member Author

stweil commented Sep 8, 2024

Yes, I think so (I did not notice that pdf.js is already part of Zotero, so the suggested first step would not increase the size of zotero-ocr). The code of pdf2text-ocr should show the requires steps to get the input for Tesseract from a PDF file.

@stweil
Copy link
Member Author

stweil commented Sep 8, 2024

On my MacBook /Applications/Zotero_7.0.3.app/Contents/Resources/omni.ja is a ZIP file which includes pdfjs. So you are right, Zotero already provides it.

@stweil
Copy link
Member Author

stweil commented Nov 13, 2024

claude.ai suggests this code:

async function getPageImageData(pdfUrl) {
  try {
    // Load the PDF document using Zotero's PDF.js integration
    const pdf = await Zotero.PDF.getDocument(pdfUrl);
    const numPages = await pdf.numPages;

    // Iterate through each page and extract the image data
    const pageImageData = [];
    for (let pageNumber = 1; pageNumber <= numPages; pageNumber++) {
      const page = await pdf.getPage(pageNumber);
      const viewport = await page.getViewport({ scale: 1 });
      const canvas = document.createElement('canvas');
      const context = canvas.getContext('2d');

      canvas.height = viewport.height;
      canvas.width = viewport.width;

      // Render the page on the canvas
      const renderContext = {
        canvasContext: context,
        viewport: viewport
      };
      await page.render(renderContext);

      // Get the image data from the canvas
      const imageData = context.getImageData(0, 0, canvas.width, canvas.height);
      pageImageData.push(imageData);
    }

    return pageImageData;
  } catch (error) {
    console.error('Error extracting PDF image data:', error);
    throw error;
  }
}

Here's how the code works:

  1. The getPageImageData function takes a PDF file URL as input.
  2. It uses the Zotero.PDF.getDocument method to load the PDF document using Zotero's PDF.js integration.
  3. It then iterates through each page of the PDF document, rendering the page on a canvas element.
  4. For each page, it extracts the image data using the getImageData method of the canvas context.
  5. The extracted image data for each page is collected and returned as an array.

You can use this function in your Zotero plugin to extract the image data for each page of a scanned PDF file. The resulting pageImageData array will contain the image data for each page, which you can then process or store as needed for your plugin's functionality.

Remember that this code assumes you have access to the PDF file URL. If you need to work with a PDF file stored in the Zotero user's library, you'll need to use the appropriate Zotero APIs to retrieve the file path or URL.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants