A tool to automatically summarize documents (or plain text) using either the BART or PreSumm Machine Learning Model.
BART (BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension) is the state-of-the-art in text summarization as of 02/02/2020. It is a "sequence-to-sequence model trained with denoising as pretraining objective" (Documentation & Examples).
PreSumm (Text Summarization with Pretrained Encoders) applies BERT (Bidirectional Encoder Representations from Transformers) to text summarization by using "a novel document-level encoder based on BERT which is able to express the semantics of a document and obtain representations for its sentences." BERT represented "the latest incarnation of pretrained language models which have recently advanced a wide range of natural language processing tasks" at the time of writing (Documentation & Examples).
- Convert a PDF to XML and then interpret that XML file using the
font
property of eachtext
element using main.py. Utilizes the xml.etree.elementtree python library. - Summarize raw text input using cmd_summarizer.py. You can run this in Google Colaboratory by clicking this button:
- Summarize multiple text files using presumm/run_summarization.py
These instructions will get you a copy of the project up and running on your local machine.
sudo apt install poppler-utils
git clone https://github.com/HHousen/docsum.git
cd docsum
conda env create --file environment.yml
conda activate docsum
pdftohtml input.pdf -i -s -c -xml output.xml
DocSum
├── bart_sum.py
├── cmd_summarizer.py
├── docsum.png
├── environment.yml
├── LICENSE
├── main.py
├── presumm
│ ├── configuration_bertabs.py
│ ├── __init__.py
│ ├── modeling_bertabs.py
│ ├── presumm.py
│ ├── run_summarization.py
│ └── utils_summarization.py
├── README.md
└── xml_processor.py
Output of python main.py --help
:
usage: main.py [-h] [-t {pdf,xml}] [-m {bart,presumm}] [--bart_checkpoint PATH] [--bart_state_dict_key PATH] [--bart_fairseq] -cf N [N ...]
-bhf N [N ...] -bf N [N ...] [-ns] [--output_xml_path PATH] [-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
PATH
Summarization of PDFs using BART
positional arguments:
PATH path to input file
optional arguments:
-h, --help show this help message and exit
-t {pdf,xml}, --file_type {pdf,xml}
type of file to summarize
-m {bart,presumm}, --model {bart,presumm}
machine learning model choice
--bart_checkpoint PATH
[BART Only] Path to optional checkpoint. Semsim is better model but will use more memory and is an additional 5GB
download. (default: none, recommended: semsim)
--bart_state_dict_key PATH
[BART Only] model state_dict key to load from pickle file specified with --bart_checkpoint (default: "model")
--bart_fairseq [BART Only] Use fairseq model from torch hub instead of huggingface transformers library models. Can not use
--bart_checkpoint if this option is supplied.
-cf N [N ...], --chapter_heading_font N [N ...]
font of chapter titles
-bhf N [N ...], --body_heading_font N [N ...]
font of headings within chapter
-bf N [N ...], --body_font N [N ...]
font of body (the text you want to summarize)
-ns, --no_summarize do not run the summarization step
--output_xml_path PATH
path to output XML file if `file_type` is `pdf`
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Set the logging level (default: 'Info').
Output of python cmd_summarizer.py --help
usage: cmd_summarizer.py [-h] -m {bart,presumm} [--bart_checkpoint PATH] [--bart_state_dict_key PATH] [--bart_fairseq]
[-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
Summarization of text using CMD prompt
optional arguments:
-h, --help show this help message and exit
-m {bart,presumm}, --model {bart,presumm}
machine learning model choice
--bart_checkpoint PATH
[BART Only] Path to optional checkpoint. Semsim is better model but will use more memory and is an additional 5GB
download. (default: none, recommended: semsim)
--bart_state_dict_key PATH
[BART Only] model state_dict key to load from pickle file specified with --bart_checkpoint (default: "model")
--bart_fairseq [BART Only] Use fairseq model from torch hub instead of huggingface transformers library models. Can not use
--bart_checkpoint if this option is supplied.
-l {DEBUG,INFO,WARNING,ERROR,CRITICAL}, --log {DEBUG,INFO,WARNING,ERROR,CRITICAL}
Set the logging level (default: 'Info').
Output of python -m presumm.run_summarization --help
usage: run_summarization.py [-h] --documents_dir DOCUMENTS_DIR [--summaries_output_dir SUMMARIES_OUTPUT_DIR] [--compute_rouge COMPUTE_ROUGE]
[--no_cuda NO_CUDA] [--batch_size BATCH_SIZE] [--min_length MIN_LENGTH] [--max_length MAX_LENGTH]
[--beam_size BEAM_SIZE] [--alpha ALPHA] [--block_trigram BLOCK_TRIGRAM]
optional arguments:
-h, --help show this help message and exit
--documents_dir DOCUMENTS_DIR
The folder where the documents to summarize are located.
--summaries_output_dir SUMMARIES_OUTPUT_DIR
The folder in wich the summaries should be written. Defaults to the folder where the documents are
--compute_rouge COMPUTE_ROUGE
Compute the ROUGE metrics during evaluation. Only available for the CNN/DailyMail dataset.
--no_cuda NO_CUDA Whether to force the execution on CPU.
--batch_size BATCH_SIZE
Batch size per GPU/CPU for training.
--min_length MIN_LENGTH
Minimum number of tokens for the summaries.
--max_length MAX_LENGTH
Maixmum number of tokens for the summaries.
--beam_size BEAM_SIZE
The number of beams to start with for each example.
--alpha ALPHA The value of alpha for the length penalty in the beam search.
--block_trigram BLOCK_TRIGRAM
Whether to block the existence of repeating trigrams in the text generated by beam search.
--file_type pdf
is only available on linux and requirespoppler-utils
to be installed
PDFs must be formatted in a specific way for this program to function. This program works with two levels of headings: chapter
headings and body
headings. Chapter headings
contain many body headings
and each body heading contains many lines of body text
. If your PDF file is organized in this way and you can find unique font styles in the XML representation, then this program should work.
Sometimes italics or other stylistic fonts may be represented by separate font numbers. If this is the case simply run the command and pass in multiple font styles: python main.py book.xml -cf 5 50 -bhf 23 34 60 -bf 11 132
.
Hayden Housen – haydenhousen.com
Distributed under the GPLv3 license. See the LICENSE for more information.
PreSumm code extensively borrowed from Hugging Face Transformers Library.
All Pull Requests are greatly welcomed.
Questions? Commends? Issues? Don't hesitate to open an issue and briefly describe what you are experiencing (with any error logs if necessary). Thanks.
- Fork it (https://github.com/HHousen/docsum/fork)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request
- Make DocSum more robust to different PDF types (multi-layered headings)
- Implement other summarization techniques
- Implement automatic header detection (Possibly this paper)