Use JavaScript for PDF conversion and OCR #80

stweil · 2024-09-08T15:32:21Z

Currently zotero-ocr requires additional installation steps for pdftoppm and tesseract.

Both could be replaced by pure JavaScript implementations which could be included in zotero-ocr to simplify the installation:

In a first step we could start with pdf.js.

The text was updated successfully, but these errors were encountered:

aborel · 2024-09-08T15:48:41Z

Interesting idea. Since the Zotero PDF viewer is based on pdf.js (as far as I understand), does this mean we're actually getting part of the pipeline out of the box?

stweil · 2024-09-08T16:05:55Z

Yes, I think so (I did not notice that pdf.js is already part of Zotero, so the suggested first step would not increase the size of zotero-ocr). The code of pdf2text-ocr should show the requires steps to get the input for Tesseract from a PDF file.

stweil · 2024-09-08T16:28:59Z

On my MacBook /Applications/Zotero_7.0.3.app/Contents/Resources/omni.ja is a ZIP file which includes pdfjs. So you are right, Zotero already provides it.

stweil · 2024-11-13T20:12:26Z

claude.ai suggests this code:

async function getPageImageData(pdfUrl) {
  try {
    // Load the PDF document using Zotero's PDF.js integration
    const pdf = await Zotero.PDF.getDocument(pdfUrl);
    const numPages = await pdf.numPages;

    // Iterate through each page and extract the image data
    const pageImageData = [];
    for (let pageNumber = 1; pageNumber <= numPages; pageNumber++) {
      const page = await pdf.getPage(pageNumber);
      const viewport = await page.getViewport({ scale: 1 });
      const canvas = document.createElement('canvas');
      const context = canvas.getContext('2d');

      canvas.height = viewport.height;
      canvas.width = viewport.width;

      // Render the page on the canvas
      const renderContext = {
        canvasContext: context,
        viewport: viewport
      };
      await page.render(renderContext);

      // Get the image data from the canvas
      const imageData = context.getImageData(0, 0, canvas.width, canvas.height);
      pageImageData.push(imageData);
    }

    return pageImageData;
  } catch (error) {
    console.error('Error extracting PDF image data:', error);
    throw error;
  }
}

Here's how the code works:

The getPageImageData function takes a PDF file URL as input.
It uses the Zotero.PDF.getDocument method to load the PDF document using Zotero's PDF.js integration.
It then iterates through each page of the PDF document, rendering the page on a canvas element.
For each page, it extracts the image data using the getImageData method of the canvas context.
The extracted image data for each page is collected and returned as an array.

You can use this function in your Zotero plugin to extract the image data for each page of a scanned PDF file. The resulting pageImageData array will contain the image data for each page, which you can then process or store as needed for your plugin's functionality.

Remember that this code assumes you have access to the PDF file URL. If you need to work with a PDF file stored in the Zotero user's library, you'll need to use the appropriate Zotero APIs to retrieve the file path or URL.

stweil added the enhancement New feature or request label Sep 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use JavaScript for PDF conversion and OCR #80

Use JavaScript for PDF conversion and OCR #80

stweil commented Sep 8, 2024

aborel commented Sep 8, 2024

stweil commented Sep 8, 2024 •

edited

Loading

stweil commented Sep 8, 2024

stweil commented Nov 13, 2024

Use JavaScript for PDF conversion and OCR #80

Use JavaScript for PDF conversion and OCR #80

Comments

stweil commented Sep 8, 2024

aborel commented Sep 8, 2024

stweil commented Sep 8, 2024 • edited Loading

stweil commented Sep 8, 2024

stweil commented Nov 13, 2024

stweil commented Sep 8, 2024 •

edited

Loading