- WikimediaDumpExtractor extracts pages from Wikimedia/Wikipedia database backup dumps.
- Download the latest release
Usage: java -jar WikimediaDumpExtractor.jar
pages <input XML file> <output directory> <categories> <search terms> <ids>
categories <input SQL file> <output directory> [minimum category size, default 10000]
The values <categories> and <search terms> can contain multiple entries separated by '|'
Website: https://github.com/EML4U/WikimediaDumpExtractor
Download the example XML file. It contains 4 pages extracted from the enwiki 20080103 dump. Then run the following command:
java -jar WikimediaDumpExtractor.jar pages enwiki-20080103-pages-articles-example.xml ./ "Social philosophy" altruism ""
Afterwards, files similar to example result will be created.
To process large XML files (e.g. enwiki 20080103 has 15 GB, enwiki 20210901 has 85 GB), set the following 3 parameters:
java -DentityExpansionLimit=0 -DtotalEntitySizeLimit=0 -Djdk.xml.totalEntitySizeLimit=0 -jar WikimediaDumpExtractor.jar ...
Get Wikimedia dumps here:
- dumps.wikimedia.org
- Current dumps of the Wikipedia (english) (now)
- Archived dumps of the Wikipedia (2001 – 2010)
- archive.org
- Collection wikimediadownloads + enwiki + data dumps (2012 – now)
- Collection wikipediadumps (2010 – 2011)
- Additional information
- Note: Dump files can be extracted with
bzip2 -dk filename.bz2
to keep archive files.
Data Science Group (DICE) at Paderborn University
This work has been supported by the German Federal Ministry of Education and Research (BMBF) within the project EML4U under the grant no 01IS19080B.