GitHub - thetobysiu/witcher-books-processing: Extract witcher epub books content, filter based on rules with beautiful soup and export a txt file suitable for GPT-2 fine-tuning

Intro

Extract epub files using beautifulsoup parser and filter content such as quote and some non-witcher related contents.

Create a dictionary structure separating sentences, chapters, and books.

Outputs a txt file with GPT-2 <|endoftext|> token prepended to each chapter/books.

Run main.py to create witcher.txt

parse.py is the module for parsing the books, it will read epubs inside books/ folder, the epubs must be uncompressed and original.

Jupyter notebook is available.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.idea		.idea
.gitignore		.gitignore
README.md		README.md
Witcher-epub-explore.ipynb		Witcher-epub-explore.ipynb
main.py		main.py
parse.py		parse.py
witcher.csv		witcher.csv
witcher.txt		witcher.txt