Living-with-machines · thobson88 · Nov 3, 2022 · Nov 3, 2022 · Nov 29, 2022 · Nov 29, 2022
diff --git a/.github/workflows/ci-tests.yml b/.github/workflows/ci-tests.yml
@@ -38,5 +38,6 @@ jobs:
         poetry run flake8 src --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
     - name: Test with pytest
       run: |
-        # pip install pytest
         poetry run pytest
+    - name: Upload coverage reports to Codecov
+      uses: codecov/codecov-action@v3
diff --git a/.gitignore b/.gitignore
@@ -105,3 +105,6 @@ dmypy.json
 
 # Ignore demo-output dir
 demo-output/
+
+# Ignore dir containing verification artifacts
+alto2txt-verify/
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -2,7 +2,7 @@
 # See https://pre-commit.com/hooks.html for more hooks
 repos:
 -   repo: https://github.com/pre-commit/pre-commit-hooks
-    rev: v4.3.0
+    rev: v4.4.0
     hooks:
     -   id: trailing-whitespace
         # Leave demo-files unaltered after download from BL website.
@@ -12,28 +12,28 @@ repos:
     -   id: check-xml
     -   id: check-added-large-files
 -   repo: https://github.com/python-poetry/poetry
-    rev: '1.3.2'
+    rev: '1.3.0'
     hooks:
     -   id: poetry-check
     -   id: poetry-lock
 -   repo: https://github.com/psf/black
-    rev: 22.6.0
+    rev: 23.1.0
     hooks:
     -   id: black
 -   repo: https://github.com/pre-commit/mirrors-autopep8
-    rev: v1.6.0  # Use the sha / tag you want to point at
+    rev: v2.0.1  # Use the sha / tag you want to point at
     hooks:
     -   id: autopep8
 -   repo: https://github.com/pre-commit/mirrors-mypy
-    rev: v0.971  # Use the sha / tag you want to point at
+    rev: v1.0.1  # Use the sha / tag you want to point at
     hooks:
     -   id: mypy
 -   repo: https://github.com/pre-commit/mirrors-isort
     rev: v5.10.1
     hooks:
     -   id: isort
 -   repo: https://github.com/hadialqattan/pycln
-    rev: v1.2.5
+    rev: v2.1.3
     hooks:
     -   id: pycln
         args: [--config=pyproject.toml]
diff --git a/README.md b/README.md
@@ -1,62 +1,64 @@
-# `alto2txt`: Extract plain text from newspapers
+# `alto2txt`: Extract plain text from digital newspaper OCR scans
 
-![GitHub](https://img.shields.io/github/license/Living-with-Machines/alto2txt) ![PyPI](https://img.shields.io/pypi/v/alto2txt) [![DOI](https://zenodo.org/badge/259340615.svg)](https://zenodo.org/badge/latestdoi/259340615)
+![GitHub](https://img.shields.io/github/license/Living-with-Machines/alto2txt) ![PyPI](https://img.shields.io/pypi/v/alto2txt) [![DOI](https://zenodo.org/badge/259340615.svg)](https://zenodo.org/badge/latestdoi/259340615) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)
 
+*Version extract_text 0.3.4*
 
+`alto2txt` converts `XML` `ALTO`/`METS` Optical Character Recognition (OCR) scans into plaintext files with minimal metadata.
 
-Converts `XML` (in `METS` `1.8`/`ALTO` `1.4`, `METS` `1.3`/`ALTO` `1.4`, `BLN` or `UKP` format) publications to plaintext articles and generates minimal metadata.
-
+**`XML` compatibility: `METS 1.8`/`ALTO 1.4`, `METS 1.3`/`ALTO 1.4`, `BLN`, or `UKP` format**
 
 ## [Full documentation and demo instructions.](https://living-with-machines.github.io/alto2txt/#/)
 
+`ALTO` and `METS` are industry standards maintained by the [US Library of Congress](https://www.loc.gov/librarians/standards) targeting newspaper digitization used by hundreds of modern, large-scale newspaper digitization projects. One text file is output per article, each complemented by one `XML` metadata file[^1] .
 
-## Installation
-
-### Installation using an Anaconda environment
-
-We recommend installation via Anaconda:
+[`METS` (Metadata Encoding and Transmission Standard)](http://www.loc.gov/standards/mets/) is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed in `XML`. [`ALTO` (Analyzed Layout and Text Objects)](https://www.loc.gov/standards/alto/) is an [`XML schema`](https://en.wikipedia.org/wiki/XML_schema) for technical metadata describing the layout and content of text resources such as book or newspaper pages. `ALTO` is often used in combination with `METS` but can also be used independently. Details of the `ALTO` schema are avilable at https://github.com/altoxml/schema.
 
-* Refer to the [Anaconda website and follow the instructions](https://docs.anaconda.com/anaconda/install/).
 
-* Create a new environment for `alto2txt`
+## Quick Install
 
-```bash
-conda create -n py37alto python=3.7
-```
+### `pip`
 
-* Activate the environment:
+As of verion `v0.3.4` `alto2txt` is available on [`PyPI`](https://pypi.org/project/alto2txt/) and can be installed via
 
-```bash
-conda activate py37alto
+```console
+$ pip install alto2txt
 ```
 
-### Installation using pip, outside an Anaconda environment
+### `conda`
 
-Note, the use of ``alto2txt`` outside a conda environment has not been as extensively tested as within a conda environment. Whilst we believe that this should work, please use with caution.
+If you are comfortable with the command line, git, and already have Python & Anaconda installed, you can install `alto2txt` by navigating to an empty directory in the terminal and run the following commands:
 
-```bash
-pip install alto2txt
+```console
+$ git clone https://github.com/Living-with-machines/alto2txt.git
+$ cd alto2txt
+$ conda create -n py37alto python=3.7
+$ conda activate py37alto
+$ pip install pyproject.toml
 ```
 
 ### Installation of a test release
 
-If you need (or want) to install a test release of `alto2txt` you will likely be advised of the specific version number to install. This examaple command will install `v0.3.1-alpha.20`:
+If you need (or want) to install a test release of `alto2txt` you will likely be advised of the specific version number to install. This command will install `v0.3.1-alpha.20`:
 
 ```bash
-pip install -i https://test.pypi.org/simple/ alto2txt==0.3.1a20
+$ pip install -i https://test.pypi.org/simple/ alto2txt==0.3.1a20
 ```
 
-## Usage
-
-Downsampling can be used to convert only every Nth issue of each newspaper. One text file is output per article, each complemented by one `XML` metadata file.
-
+[Click here](https://living-with-machines.github.io/alto2txt/#Demo.md) for more in-depth installation instructions using demo files.
 
+## Usage
 
+> *Note*: the formatting below is altered for readability
 ```
-usage: alto2txt [-h] [-p [PROCESS_TYPE]] [-l [LOG_FILE]] [-d [DOWNSAMPLE]] [-n [NUM_CORES]]
+$ alto2txt -h
+
+usage: alto2txt [-h]
+                [-p [PROCESS_TYPE]]
+                [-l [LOG_FILE]]
+                [-d [DOWNSAMPLE]]
+                [-n [NUM_CORES]]
                 xml_in_dir txt_out_dir
-alto2txt [-h] [-p [PROCESS_TYPE]] [-l [LOG_FILE]] [-d [DOWNSAMPLE]] [-n [NUM_CORES]]
-         xml_in_dir txt_out_dir
 
 Converts XML publications to plaintext articles
 
@@ -75,91 +77,92 @@ optional arguments:
   -n [NUM_CORES], --num-cores [NUM_CORES]
                         Number of cores (Spark only). Default 1")
 ```
+To read about downsampling, logs, and using spark see [Advanced Information](https://living-with-machines.github.io/alto2txt/#/advanced).
 
-`xml_in_dir` is expected to hold `XML` for multiple publications, in the following structure:
-
-```
-xml_in_dir
-|-- publication
-|   |-- year
-|   |   |-- issue
-|   |   |   |-- xml_content
-|   |-- year
-|-- publication
-```
-
-However, if `-p|--process-type single` is provided then `xml_in_dir` is expected to hold `XML` for a single publication, in the following structure:
-
-```
-xml_in_dir
-|-- year
-|   |-- issue
-|   |   |-- xml_content
-|-- year
-```
-
-`txt_out_dir` is created with an analogous structure to `xml_in_dir`.
+## Process Types
 
-`PROCESS_TYPE` can be one of:
+`-p | -process-type` can be one of:
 
 * `single`: Process single publication.
 * `serial`: Process publications serially.
 * `multi`: Process publications using multiprocessing (default).
 * `spark`: Process publications using Spark.
 
-`DOWNSAMPLE` must be a positive integer, default 1.
+### Process Multiple Publications
 
-The following `XSLT` files need to be in an `extract_text.xslts` module:
+For default settings, (`multi`) multiprocessing assumes the following directory structure for multiple publications in `xml_in_dir`:
 
-* `extract_text_mets18.xslt`: `METS 1.8 XSL` file.
-* `extract_text_mets13.xslt`: `METS 1.3 XSL` file.
-* `extract_text_bln.xslt`: `BLN XSL` file.
-* `extract_text_ukp.xslt`: `UKP XSL` file.
-
-## Process publications
+```
+xml_in_dir/
+  ├── publication
+  │     ├── year
+  │     │     └── issue
+  │     │            └── xml_content
+  │     └── year
+  └── publication
+```
+Assuming `xml_in_dir` follows this structure, run alto2txt with the following in the terminal:
 
-Assume folder `BNA` exists and matches the structure above.
+```console
+$ alto2txt xml_in_dir txt_out_dir
+```
 
-Extract text from every publication:
+To downsample and only process every 100th edition:
 
-```bash
-alto2txt BNA txt
+```console
+$ alto2txt xml_in_dir txt_out_dir -d 100
 ```
 
-Extract text from every 100th issue of every publication:
 
-```bash
-alto2txt BNA txt -d 100
-```
+### Process Single Publication
 
-## Process a single publication
+[A demo for processing a single publication is available here.](https://living-with-machines.github.io/alto2txt/#/?id=process-single-publication)
 
-Extract text from every issue of a single publication:
+If `-p|--process-type single` is provided then `xml_in_dir` is expected to hold `XML` for a single publication, in the following structure:
 
-```bash
-alto2txt -p single BNA/0000151 txt
+```
+xml_in_dir/
+  ├── year
+  │     └── issue
+  │           └── xml_content
+  └── year
 ```
 
-Extract text from every 100th issue of a single publication:
+Assuming `xml_in_dir` follows this structure, run `alto2txt` with the following in the terminal in the folder `xml_in_dir` is stored in:
 
-```bash
-alto2txt -p single BNA/0000151 txt -d 100
+```console
+$ alto2txt -p single xml_in_dir txt_out_dir
 ```
 
+To downsample and only process every 100th edition from the one publication:
+
+```console
+$ alto2txt -p single xml_in_dir txt_out_dir -d 100
+```
+
+### Plain Text Files Output
+
+`txt_out_dir` is created with an analogous structure to `xml_in_dir`.
+One `.txt` file and one metadata `.xml` file are produced per article.
+
+
 ## Configure logging
 
 By default, logs are put in `out.log`.
 
 To specify an alternative location for logs, use the `-l` flag e.g.
 
-```bash
-alto2txt -l mylog.txt BNA txt -d 100 2> err.log
+```console
+$ alto2txt -l mylog.txt single xml_in_dir txt_out_dir -d 100 2> err.log
 ```
 
 ## Process publications via Spark
 
-[Information on running on spark.](spark_instructions.md)
+[Information on running on spark.](https://living-with-machines.github.io/alto2txt/#/advanced?id=using-spark)
 
+## Contributing
+
+Suggestions, code, tests, further documentation and features – especially to cover various OCR output formats – are needed and welcome. For details and examples see the [Contributing](https://living-with-machines.github.io/alto2txt/#/contributing) section.
 
 ## Future work
 
@@ -191,3 +194,6 @@ This data is "CC0 1.0 Universal Public Domain" - [No Copyright - Other Known Leg
 This software has been developed as part of the [Living with Machines](https://livingwithmachines.ac.uk) project.
 
 This project, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and the Universities of Cambridge, East Anglia, Exeter, and Queen Mary University of London. Grant reference: AH/S01179X/1
+
+> Last updated 2023-02-21
+[^1]: For a more detailed description see: https://www.coloradohistoricnewspapers.org/forum/what-is-metsalto/