Import all the tsv wikimedia history dump to mongodb
The purpose of this repo is to import all the italian tsv wikimedia history dump in a mongodb database. The reference to the dump is here.
All the data in the tsv is preserved, but separated in three collections in base of the event_type: revisions, pages and users. The types are correctly parsed before inserting to mongodb, so the timestamps become dates, the comma-separated lists become arrays of strings, ecc. ecc.
The repo consists in only two files:
main.py
: It is a python script that given a tsv file creates three json files (one for collection) ready to be imported.lavora.sh
: It is a bash script that for each year of the italian history dump, downloads the compressed file, decompresses it, jsonizes it through the python script, imports the json files through mongoimport and deletes the files that are no more needed.
Just execute ./lavora.sh
, after making it executable through chmod +x lavora.sh
.
You can choose which range of years to download modifing the FROM
and TO
variables in lavora.sh
.
The script could take hours in order to finish.