Skip to content

Commit

Permalink
Update documentation to install pdfium
Browse files Browse the repository at this point in the history
  • Loading branch information
Belval committed Jun 20, 2024
1 parent f21bf96 commit 1c0d7f2
Show file tree
Hide file tree
Showing 19 changed files with 36 additions and 35 deletions.
3 changes: 2 additions & 1 deletion docs/source/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@ ____________________________________

Textractor is available on PyPI and can be installed with :code:`pip install amazon-textract-textractor`. By default this will install the minimal version of textractor. The following extras can be used to add features:

- :code:`pdf` (:code:`pip install amazon-textract-textractor[pdf]`) includes :code:`pdf2image` and enables PDF rasterization in Textractor. Note that this is **not** necessary to call Textract with a PDF file.
- :code:`pdfium` (:code:`pip install amazon-textract-textractor[pdfium]`) includes :code:`pypdfium2` and is the recommended way to enable PDF rasterization in Textractor. Note that this is **not** necessary to call Textract with a PDF file.
- :code:`pdf` (:code:`pip install amazon-textract-textractor[pdf]`) includes :code:`pdf2image` and is an additional way to enable PDF rasterization in Textractor. Note that this is **not** necessary to call Textract with a PDF file.
- :code:`torch` (:code:`pip install amazon-textract-textractor[torch]`) includes :code:`sentence_transformers` for better word search and matching. This will work on CPU but be noticeably slower than non-machine learning based approaches.
- :code:`dev` (:code:`pip install amazon-textract-textractor[dev]`) includes all the dependencies above and everything else needed to test the code.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"\n",
"`pip install amazon-textract-textractor`\n",
"\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdf]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdfium]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"\n",
"## Calling Textract"
]
Expand Down
2 changes: 1 addition & 1 deletion docs/source/notebooks/going_further.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
"\n",
"`pip install amazon-textract-textractor`\n",
"\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdf]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdfium]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"\n",
"## Calling Textract\n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion docs/source/notebooks/interfacing_with_trp2.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"\n",
"`pip install amazon-textract-textractor`\n",
"\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdf]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdfium]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"\n",
"## Calling Textract"
]
Expand Down
4 changes: 2 additions & 2 deletions docs/source/notebooks/introduction_to_searching.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/source/notebooks/layout_analysis.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"\n",
"`pip install amazon-textract-textractor`\n",
"\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdf]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdfium]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"\n",
"## Calling Textract"
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"\n",
"`pip install amazon-textract-textractor`\n",
"\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdf]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdfium]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"\n",
"## Calling Textract"
]
Expand Down
2 changes: 1 addition & 1 deletion docs/source/notebooks/parsing_an_existing_response.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"\n",
"`pip install amazon-textract-textractor`\n",
"\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdf]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdfium]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"\n",
"## Not calling Textract\n",
"\n",
Expand Down
6 changes: 3 additions & 3 deletions docs/source/notebooks/signature_detection.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/source/notebooks/simple_ocr.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"\n",
"`pip install amazon-textract-textractor`\n",
"\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdf]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)"
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdfium]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)"
]
},
{
Expand Down
8 changes: 4 additions & 4 deletions docs/source/notebooks/table_data_to_various_formats.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion docs/source/notebooks/tabular_data_linearization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"\n",
"`pip install amazon-textract-textractor`\n",
"\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdf]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdfium]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"\n",
"## Calling Textract"
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"\n",
"`pip install amazon-textract-textractor`\n",
"\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdf]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdfium]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"\n",
"## Calling Textract"
]
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
"\n",
"`pip install amazon-textract-textractor`\n",
"\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdf]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"There are various sets of dependencies available to tailor your installation to your use case. The base package will have sensible default, but you may want to install the PDF extra dependencies if your workflow uses PDFs with `pip install amazon-textract-textractor[pdfium]`. You can read more on extra dependencies [in the documentation](https://aws-samples.github.io/amazon-textract-textractor/installation.html)\n",
"\n",
"## Calling Textract"
]
Expand Down
4 changes: 2 additions & 2 deletions docs/source/notebooks/using_analyze_expense.ipynb

Large diffs are not rendered by default.

6 changes: 3 additions & 3 deletions docs/source/notebooks/using_analyze_id.ipynb

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions docs/source/notebooks/using_queries.ipynb

Large diffs are not rendered by default.

12 changes: 6 additions & 6 deletions docs/source/notebooks/visualizing_results.ipynb

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions docs/source/using_in_lambda.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ policy. We recommend that you review your lambda function and tailor the permiss

.. image:: images/lambda_tutorial/1b.png

c. Scroll to the bottom of the page and download the package that matches your Python installation. Packages with the `-pdf` suffix contains `pdf2image` and allow you to process PDF documents.
c. Scroll to the bottom of the page and download the package that matches your Python installation. Packages with the `-pdfium` suffix contain `pypdfium2` and allow you to process PDF documents. Packages with the `-pdf` suffix contain `pdf2image` and also allow you to process PDF documents, however we recommend using `pypdfium2` as it does not require any OS-level dependencies.

.. image:: images/lambda_tutorial/1c.png

Expand Down Expand Up @@ -59,7 +59,7 @@ policy. We recommend that you review your lambda function and tailor the permiss

4. Update your code to use Textractor

a. If using the PDF version you have to update the `PATH` and `LD_LIBRARY_PATH` environment variables through the lambda function configuration interface or directly in code with the `os` module:
a. If using the `pdf2image` PDF version you have to update the `PATH` and `LD_LIBRARY_PATH` environment variables through the lambda function configuration interface or directly in code with the `os` module:

.. code-block:: python
Expand Down

0 comments on commit 1c0d7f2

Please sign in to comment.