Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shell script to verify output directory structure #52

Open
wants to merge 39 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
1b9bb26
Add script to verify output directory structure
thobson88 Nov 3, 2022
68798e1
Add verification directory to gitignore
thobson88 Nov 3, 2022
7456be2
Merge remote-tracking branch 'origin' into feature/verify-output-dirs
spool Nov 29, 2022
4aa4c5f
Set pyspark as optional dependency and add ipython for pytest ipdb su…
spool Nov 29, 2022
4728755
Update extract_text_common.xslt to reflect version 0.3.4
spool Nov 29, 2022
ddb771e
Add tests and -t option to include alto2txt-verify.sh in the command …
spool Nov 30, 2022
a7b5711
Merge branch 'main' into feature/verify-output-dirs
spool Nov 30, 2022
ea8a0d8
Added 'digitised' to the heading
mialondon Dec 1, 2022
155fa81
Fix typos in and rephrase documentation.
spool Dec 6, 2022
6e8f237
Add a footnote reference following #55
griff-rees Dec 6, 2022
54876e0
Merge from main.
spool Feb 16, 2023
00160c4
Merge pre-commit config/dependencies from main.
spool Feb 16, 2023
e16015d
Update poetry.lock
spool Feb 21, 2023
4ef3063
Merge main and manage difference in README.md
spool Feb 21, 2023
73efc6d
Update docs to ease merge with main and manage footnotes.
spool Feb 21, 2023
1fb962d
Merge pull request #55 from Living-with-machines/mialondon-patch-1
griff-rees Feb 21, 2023
7f1158a
Manage diff between README.md and docs/README.md
griff-rees Feb 21, 2023
548cbb8
Merge branch 'main' into doc-copy-edits
griff-rees Feb 21, 2023
264d0a1
docs: add contributing.md and update pre-commit
spool Feb 21, 2023
2cf44b4
docs: add contributing.md and enable in sidebar
spool Feb 21, 2023
29f9917
docs: Merge and fix from branch 'origin/main' into doc-copy-edits
spool Feb 21, 2023
71be695
docs: Fix merge of README.md and add $ to console examples
spool Feb 21, 2023
beb4f0b
docs: fix merge of README.md with main.
spool Feb 21, 2023
18be46e
fix: update extract_text_common.xslt version
spool Feb 21, 2023
ee3e31e
docs: expand contribution sections on pytest and update versions
spool Feb 22, 2023
ec66a8f
fix: black formatting in xml_totext.py and tests.
spool Feb 22, 2023
c92c6ab
Merge pull request #71 from Living-with-machines/doc-copy-edits
griff-rees Feb 22, 2023
d514f5a
docs: flip README.md badges and xml version order
griff-rees Feb 22, 2023
097b061
docs: harmonise shell prompt examples and rm dupe of contrib info in …
spool Feb 22, 2023
0513659
Add codecov to ci-tests.py
griff-rees Feb 22, 2023
2c83905
fix: merge changes added via pre-commit linting/docs
spool Feb 22, 2023
cb56dd7
Merge remote-tracking branch 'origin/main'
spool Feb 22, 2023
155fbab
fix: documentation links in README.md
griff-rees Feb 23, 2023
324d044
fix: poetry.lock coverage 7.1.0 -> 7.2.1
spool Feb 27, 2023
121f2a3
Merge remote-tracking branch 'origin/main'
spool Feb 27, 2023
38431bb
chore: update poetry.lock depedencies
spool Apr 14, 2023
482e5e4
chore: update poetry.lock depedencies and docs
spool Apr 14, 2023
af195d1
feat: add spark
spool Apr 14, 2023
1cc6dec
fix: test section of branch for
spool Apr 14, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/ci-tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,5 +38,6 @@ jobs:
poetry run flake8 src --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
# pip install pytest
poetry run pytest
- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v3
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -105,3 +105,6 @@ dmypy.json

# Ignore demo-output dir
demo-output/

# Ignore dir containing verification artifacts
alto2txt-verify/
12 changes: 6 additions & 6 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
# See https://pre-commit.com/hooks.html for more hooks
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
rev: v4.4.0
hooks:
- id: trailing-whitespace
# Leave demo-files unaltered after download from BL website.
Expand All @@ -12,28 +12,28 @@ repos:
- id: check-xml
- id: check-added-large-files
- repo: https://github.com/python-poetry/poetry
rev: '1.3.2'
rev: '1.3.0'
hooks:
- id: poetry-check
- id: poetry-lock
- repo: https://github.com/psf/black
rev: 22.6.0
rev: 23.1.0
hooks:
- id: black
- repo: https://github.com/pre-commit/mirrors-autopep8
rev: v1.6.0 # Use the sha / tag you want to point at
rev: v2.0.1 # Use the sha / tag you want to point at
hooks:
- id: autopep8
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v0.971 # Use the sha / tag you want to point at
rev: v1.0.1 # Use the sha / tag you want to point at
hooks:
- id: mypy
- repo: https://github.com/pre-commit/mirrors-isort
rev: v5.10.1
hooks:
- id: isort
- repo: https://github.com/hadialqattan/pycln
rev: v1.2.5
rev: v2.1.3
hooks:
- id: pycln
args: [--config=pyproject.toml]
166 changes: 86 additions & 80 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,62 +1,64 @@
# `alto2txt`: Extract plain text from newspapers
# `alto2txt`: Extract plain text from digital newspaper OCR scans

![GitHub](https://img.shields.io/github/license/Living-with-Machines/alto2txt) ![PyPI](https://img.shields.io/pypi/v/alto2txt) [![DOI](https://zenodo.org/badge/259340615.svg)](https://zenodo.org/badge/latestdoi/259340615)
![GitHub](https://img.shields.io/github/license/Living-with-Machines/alto2txt) ![PyPI](https://img.shields.io/pypi/v/alto2txt) [![DOI](https://zenodo.org/badge/259340615.svg)](https://zenodo.org/badge/latestdoi/259340615) [![pre-commit](https://img.shields.io/badge/pre--commit-enabled-brightgreen?logo=pre-commit)](https://github.com/pre-commit/pre-commit)

*Version extract_text 0.3.4*

`alto2txt` converts `XML` `ALTO`/`METS` Optical Character Recognition (OCR) scans into plaintext files with minimal metadata.

Converts `XML` (in `METS` `1.8`/`ALTO` `1.4`, `METS` `1.3`/`ALTO` `1.4`, `BLN` or `UKP` format) publications to plaintext articles and generates minimal metadata.

**`XML` compatibility: `METS 1.8`/`ALTO 1.4`, `METS 1.3`/`ALTO 1.4`, `BLN`, or `UKP` format**

## [Full documentation and demo instructions.](https://living-with-machines.github.io/alto2txt/#/)

`ALTO` and `METS` are industry standards maintained by the [US Library of Congress](https://www.loc.gov/librarians/standards) targeting newspaper digitization used by hundreds of modern, large-scale newspaper digitization projects. One text file is output per article, each complemented by one `XML` metadata file[^1] .

## Installation

### Installation using an Anaconda environment

We recommend installation via Anaconda:
[`METS` (Metadata Encoding and Transmission Standard)](http://www.loc.gov/standards/mets/) is a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, expressed in `XML`. [`ALTO` (Analyzed Layout and Text Objects)](https://www.loc.gov/standards/alto/) is an [`XML schema`](https://en.wikipedia.org/wiki/XML_schema) for technical metadata describing the layout and content of text resources such as book or newspaper pages. `ALTO` is often used in combination with `METS` but can also be used independently. Details of the `ALTO` schema are avilable at https://github.com/altoxml/schema.

* Refer to the [Anaconda website and follow the instructions](https://docs.anaconda.com/anaconda/install/).

* Create a new environment for `alto2txt`
## Quick Install

```bash
conda create -n py37alto python=3.7
```
### `pip`

* Activate the environment:
As of verion `v0.3.4` `alto2txt` is available on [`PyPI`](https://pypi.org/project/alto2txt/) and can be installed via

```bash
conda activate py37alto
```console
$ pip install alto2txt
```

### Installation using pip, outside an Anaconda environment
### `conda`

Note, the use of ``alto2txt`` outside a conda environment has not been as extensively tested as within a conda environment. Whilst we believe that this should work, please use with caution.
If you are comfortable with the command line, git, and already have Python & Anaconda installed, you can install `alto2txt` by navigating to an empty directory in the terminal and run the following commands:

```bash
pip install alto2txt
```console
$ git clone https://github.com/Living-with-machines/alto2txt.git
$ cd alto2txt
$ conda create -n py37alto python=3.7
$ conda activate py37alto
$ pip install pyproject.toml
```

### Installation of a test release

If you need (or want) to install a test release of `alto2txt` you will likely be advised of the specific version number to install. This examaple command will install `v0.3.1-alpha.20`:
If you need (or want) to install a test release of `alto2txt` you will likely be advised of the specific version number to install. This command will install `v0.3.1-alpha.20`:

```bash
pip install -i https://test.pypi.org/simple/ alto2txt==0.3.1a20
$ pip install -i https://test.pypi.org/simple/ alto2txt==0.3.1a20
```

## Usage

Downsampling can be used to convert only every Nth issue of each newspaper. One text file is output per article, each complemented by one `XML` metadata file.

[Click here](https://living-with-machines.github.io/alto2txt/#Demo.md) for more in-depth installation instructions using demo files.

## Usage

> *Note*: the formatting below is altered for readability
```
usage: alto2txt [-h] [-p [PROCESS_TYPE]] [-l [LOG_FILE]] [-d [DOWNSAMPLE]] [-n [NUM_CORES]]
$ alto2txt -h

usage: alto2txt [-h]
[-p [PROCESS_TYPE]]
[-l [LOG_FILE]]
[-d [DOWNSAMPLE]]
[-n [NUM_CORES]]
xml_in_dir txt_out_dir
alto2txt [-h] [-p [PROCESS_TYPE]] [-l [LOG_FILE]] [-d [DOWNSAMPLE]] [-n [NUM_CORES]]
xml_in_dir txt_out_dir

Converts XML publications to plaintext articles

Expand All @@ -75,91 +77,92 @@ optional arguments:
-n [NUM_CORES], --num-cores [NUM_CORES]
Number of cores (Spark only). Default 1")
```
To read about downsampling, logs, and using spark see [Advanced Information](https://living-with-machines.github.io/alto2txt/#/advanced).

`xml_in_dir` is expected to hold `XML` for multiple publications, in the following structure:

```
xml_in_dir
|-- publication
| |-- year
| | |-- issue
| | | |-- xml_content
| |-- year
|-- publication
```

However, if `-p|--process-type single` is provided then `xml_in_dir` is expected to hold `XML` for a single publication, in the following structure:

```
xml_in_dir
|-- year
| |-- issue
| | |-- xml_content
|-- year
```

`txt_out_dir` is created with an analogous structure to `xml_in_dir`.
## Process Types

`PROCESS_TYPE` can be one of:
`-p | -process-type` can be one of:

* `single`: Process single publication.
* `serial`: Process publications serially.
* `multi`: Process publications using multiprocessing (default).
* `spark`: Process publications using Spark.

`DOWNSAMPLE` must be a positive integer, default 1.
### Process Multiple Publications

The following `XSLT` files need to be in an `extract_text.xslts` module:
For default settings, (`multi`) multiprocessing assumes the following directory structure for multiple publications in `xml_in_dir`:

* `extract_text_mets18.xslt`: `METS 1.8 XSL` file.
* `extract_text_mets13.xslt`: `METS 1.3 XSL` file.
* `extract_text_bln.xslt`: `BLN XSL` file.
* `extract_text_ukp.xslt`: `UKP XSL` file.

## Process publications
```
xml_in_dir/
├── publication
│ ├── year
│ │ └── issue
│ │ └── xml_content
│ └── year
└── publication
```
Assuming `xml_in_dir` follows this structure, run alto2txt with the following in the terminal:

Assume folder `BNA` exists and matches the structure above.
```console
$ alto2txt xml_in_dir txt_out_dir
```

Extract text from every publication:
To downsample and only process every 100th edition:

```bash
alto2txt BNA txt
```console
$ alto2txt xml_in_dir txt_out_dir -d 100
```

Extract text from every 100th issue of every publication:

```bash
alto2txt BNA txt -d 100
```
### Process Single Publication

## Process a single publication
[A demo for processing a single publication is available here.](https://living-with-machines.github.io/alto2txt/#/?id=process-single-publication)

Extract text from every issue of a single publication:
If `-p|--process-type single` is provided then `xml_in_dir` is expected to hold `XML` for a single publication, in the following structure:

```bash
alto2txt -p single BNA/0000151 txt
```
xml_in_dir/
├── year
│ └── issue
│ └── xml_content
└── year
```

Extract text from every 100th issue of a single publication:
Assuming `xml_in_dir` follows this structure, run `alto2txt` with the following in the terminal in the folder `xml_in_dir` is stored in:

```bash
alto2txt -p single BNA/0000151 txt -d 100
```console
$ alto2txt -p single xml_in_dir txt_out_dir
```

To downsample and only process every 100th edition from the one publication:

```console
$ alto2txt -p single xml_in_dir txt_out_dir -d 100
```

### Plain Text Files Output

`txt_out_dir` is created with an analogous structure to `xml_in_dir`.
One `.txt` file and one metadata `.xml` file are produced per article.


## Configure logging

By default, logs are put in `out.log`.

To specify an alternative location for logs, use the `-l` flag e.g.

```bash
alto2txt -l mylog.txt BNA txt -d 100 2> err.log
```console
$ alto2txt -l mylog.txt single xml_in_dir txt_out_dir -d 100 2> err.log
```

## Process publications via Spark

[Information on running on spark.](spark_instructions.md)
[Information on running on spark.](https://living-with-machines.github.io/alto2txt/#/advanced?id=using-spark)

## Contributing

Suggestions, code, tests, further documentation and features – especially to cover various OCR output formats – are needed and welcome. For details and examples see the [Contributing](https://living-with-machines.github.io/alto2txt/#/contributing) section.

## Future work

Expand Down Expand Up @@ -191,3 +194,6 @@ This data is "CC0 1.0 Universal Public Domain" - [No Copyright - Other Known Leg
This software has been developed as part of the [Living with Machines](https://livingwithmachines.ac.uk) project.

This project, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and the Universities of Cambridge, East Anglia, Exeter, and Queen Mary University of London. Grant reference: AH/S01179X/1

> Last updated 2023-02-21
[^1]: For a more detailed description see: https://www.coloradohistoricnewspapers.org/forum/what-is-metsalto/
Loading