Skip to content

KBNLresearch/textExtractDemo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EPUB text extraction demo

Contents of this repo

Demo scripts

Each of these demo scripts iterates over files with an .epub extension in a user-defined input directory. For each of these files, it extracts the text, and writes the extracted text (using UTF-8 encoding) to a file in a user-defined output directory. It also writes a summary file with the word count for each EPUB.

Tika-python script

Usage

python3 extract-tika.py [-h] [--trim] dirIn dirOut

positional arguments:

  • dirIn: directory with input EPUB files
  • dirOut: output directory
  • -h, --help: show help message and exit

Example

python3 ./textExtractDemo/scripts/extract-tika.py DBNL_EPUBS_moderneromans/ out-dbnl/

Textract script

Usage

python3 extract-textract.py [-h] dirIn dirOut

positional arguments:

  • dirIn: directory with input EPUB files
  • dirOut: output directory
  • -h, --help: show help message and exit

Example

python3 ./textExtractDemo/scripts/extract-textract.py DBNL_EPUBS_moderneromans/ out-dbnl/

Ebooklib script

Usage

python3 extract-ebooklib.py [-h] dirIn dirOut

positional arguments:

  • dirIn: directory with input EPUB files
  • dirOut: output directory
  • -h, --help: show help message and exit

Example

python3 ./textExtractDemo/scripts/extract-ebooklib.py DBNL_EPUBS_moderneromans/ out-dbnl/

Releases

No releases published

Packages

No packages published