spacypdfreader

Easy PDF to text to spaCy text extraction in Python.

Documentation: https://samedwardes.github.io/spacypdfreader/

Source code: https://github.com/SamEdwardes/spacypdfreader

PyPi: https://pypi.org/project/spacypdfreader/

spaCy universe: https://spacy.io/universe/project/spacypdfreader

spacypdfreader is a python library for extracting text from PDF documents into spaCy Doc objects. When you use spacypdfreader the token and doc objects from spacy are annotated with additional information about the pdf.

The key features are:

PDF to spaCy Doc object: Convert a PDF document directly into a spaCy Doc object.
Custom spaCy attributes and methods:
- token._.page_number
- doc._.page_range
- doc._.first_page
- doc._.last_page
- doc._.pdf_file_name
- doc._.page(int)
Multiple parsers: Select between multiple built in PDF to text parsers or bring your own PDF to text parser.

Installation

Install spacypdfreader using pip:

pip install spacypdfreader

To install with the required pytesseract dependencies:

pip install 'spacypdfreader[pytesseract]'

Usage

import spacy

from spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")
doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)

# Get the page number of any token.
print(doc[0]._.page_number)  # 1
print(doc[-1]._.page_number)  # 4

# Get page meta data about the PDF document.
print(doc._.pdf_file_name)  # "tests/data/test_pdf_01.pdf"
print(doc._.page_range)  # (1, 4)
print(doc._.first_page)  # 1
print(doc._.last_page)  # 4

# Get all of the text from a specific PDF page.
print(doc._.page(4))  # "able to display the destination page (unless..."

What is spaCy?

spaCy is a natural language processing (NLP) tool. It can be used to perform a variety of NLP tasks. For more information check out the excellent documentation at https://spacy.io.

Implementation Notes

spaCyPDFreader behaves a little bit different than your typical spaCy custom component. Typically a spaCy component should receive and return a spacy.tokens.Doc object.

spaCyPDFreader breaks this convention because the text must first be extracted from the PDF. Instead pdf_reader takes a path to a PDF file and a spacy.Language object as parameters and returns a spacy.tokens.Doc object. This allows users an easy way to extract text from PDF files while still allowing them use and customize all of the features spacy has to offer by allowing you to pass in the spacy.Language object.

Example of a "traditional" spaCy pipeline component negspaCy:

import spacy
from negspacy.negation import Negex

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("negex", config={"ent_types": ["PERSON", "ORG"]})
doc = nlp("She does not like Steve Jobs but likes Apple products.")

Example of spaCyPDFreader usage:

import spacy

from spacypdfreader import pdf_reader

nlp = spacy.load("en_core_web_sm")

doc = pdf_reader("tests/data/test_pdf_01.pdf", nlp)

Note that the nlp.add_pipe is not used by spaCyPDFreader.

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
.github/workflows		.github/workflows
docs		docs
notebooks		notebooks
spacypdfreader		spacypdfreader
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
justfile		justfile
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spacypdfreader

Installation

Usage

What is spaCy?

Implementation Notes

About

Releases 6

Languages

License

SamEdwardes/spacypdfreader

Folders and files

Latest commit

History

Repository files navigation

spacypdfreader

Installation

Usage

What is spaCy?

Implementation Notes

About

Resources

License

Stars

Watchers

Forks

Releases 6

Languages