Consistency between Wikidata and Wikipedia Categories

We perform an analysis of consistencies between Wikipedia category members and their Wikidata counterpart entities retrieved by executing the SPARQL queries attached to the Wikidata categories under the property Wikidata SPARQL query equivalent.
We focus on comparing the member sets of the two sources, and automatically investigating possible reasons making them not identical, in addition to a comparison of the consistency of the information provided within the Wikidata category itself.
For that, we propose a pipeline for the generation and evaluation of relevant data (categories and Wikidata/Wikipedia entities sets).
Wikipedia category members are retrieved by traversing the Wikipedia category hierarchy in all available languages.
This repository provides the steps needed to either reproduce the current results or conduct a new experiment using other Wikipedia/Wikidata versions.
The pipeline works using both Wikipedia/Wikidata public endpoints and SQL/JSON dumps.
Our experiments were run using the Wikidata JSON dump of 2022-05-02 and the Wikipedia SQL dumps of 2022-05-01.

Candidate Generation and Evaluation

The figure above depicts the workflow for data generation and evaluation (see paper for more details).

Cache Population

We provide two options to run the generation workflow:

From Dumps

First, two kind of dump versions should be selected:

Wikidata JSON Dump (wikidata-<version>-all.json.gz)
Wikipedia SQL Dumps (available dataset versions could be checked by visiting e.g., enwiki for English Wikipedia)

Second, Setup a MariaDB database (place where Wikipedia SQL Dumps will be imported)

Install MariaDB : sudo apt-get install mariadb-server.

Set root password:

$ sudo mysql -u root
MariaDB [(none)]> SET PASSWORD = PASSWORD('DB_PASSWORD');
MariaDB [(none)]> update mysql.user set plugin = 'mysql_native_password' where User='root';
MariaDB [(none)]> FLUSH PRIVILEGES;

Create a Database:

$ sudo mysql -u root
MariaDB [(none)]> create database <DB_NAME> character set binary;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> use <DB_NAME>;
Database changed

Optimize Database import by setting following parameter in /etc/mysql/my.cnf and then restart the database server service mysql restart.

wait_timeout = 604800
innodb_buffer_pool_size = 8G
innodb_log_buffer_size = 1G
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 2
innodb_doublewrite = 0
innodb_write_io_threads = 16

Update configuration file in ./src/cache-population/config.js

// wikidata dump version
wdDump: path.join(__dirname,'..', '..', '..', 'wikidata-<version>-all.json.gz'),
// wikipedia dump version
dumpDate : <wikipedia-version>,
// wikipedia database user
user: 'root',
// wikipedia database password
password: <DB_PASSWORD>,
// wikipedia database name
databaseName: <DB_NAME>,

Install Node.js (minimum v14.16.1)
First download the repository and install dependencies: run npm install in the project root folder.
Populate caches from dumps : npm run runner
Continue with the steps described under Candidate Generation and Cleaning

From Endpoints

This option does not need local setup or prior cache population.
It allows to directly send requests to Wikidata/Wikipedia public endpoints:
- Wikidata public endpoint
- MediaWiki API
To generate relevant data from endpoints, the steps under Candidate Generation and Cleaning can directly be followed.

Candidate Generation and Cleaning

To reproduce the current experiments, one can make use of the already filled caches provided on Zenodo: .
- A collection of SQLite database files containing all data retrieved from the Wikidata JSON dump of 2022-05-02 and the Wikipedia SQL dumps of 2022-05-01.
To reproduce current experiments or perform new ones, the steps below should be followed (start directly with step 3 if Node.js and dependencies were already installed in From Dumps):

Install Node.js (minimum v14.16.1)
First download the repository and install dependencies: run npm install in the project root folder.

Generate from Endpoints : setup endpointEnabled: true in ./src/config/config.js.
Generate from Dumps : setup endpointEnabled: false in ./src/config/config.js.
Reproduce current experiments : setup endpointEnabled: false in ./src/config/config.js. In the root folder create a folder ./cache/, unzip wiki-category-consistency-cache.zip and copy the content of the underlying folder (cache) in the newly created cache folder.

To generate candidate categories run npm run generateCandidate in the root folder. The output files can be found under ./dataset/ . The output is a JSON file containing all needed raw data needed for further processing: ./dataset/raw-data.json, in addition to log and statistics files.
Next, for candidate cleaning, run npm run cleanCandidate in the root folder. The output is a JSON file with selected category entries: ./dataset/finalData.json together with statistics file.

The output of the candidate generation and cleaning steps using the Wikidata JSON dump of 2022-05-02 and the Wikipedia SQL dumps of 2022-05-01 (./dataset/raw-data.json, ./dataset/finalData.json, and accompanying files) are provided on Zenodo: .

Evaluation

The evaluation takes as input the file with cleaned category entries ./dataset/finalData.json.

Generate the needed evaluation metrics by running npm run compareSPARQL in the root folder. The output is a JSON file ./eval/compare.json together with an additional statistics file.
Plot the charts by running npm run generatePlots in the root folder. The output is a collection of html files under the folder ./charts/.

Experiment results (./eval/compare.json with a statistics file) using the previously mentioned dump versions, are provided on Zenodo: .

Citation

@inproceedings{DBLP:conf/semweb/Feddoul0S22,
  author       = {Leila Feddoul and
                  Frank L{\"{o}}ffler and
                  Sirko Schindler},
  editor       = {Lucie{-}Aim{\'{e}}e Kaffee and
                  Simon Razniewski and
                  Gabriel Amaral and
                  Kholoud Saad Alghamdi},
  title        = {Analysis of Consistency between Wikidata and Wikipedia categories},
  booktitle    = {Proceedings of the 3rd Wikidata Workshop 2022 co-located with the
                  21st International Semantic Web Conference (ISWC2022), Virtual Event,
                  Hanghzou, China, October 2022},
  series       = {{CEUR} Workshop Proceedings},
  volume       = {3262},
  publisher    = {CEUR-WS.org},
  year         = {2022},
  url          = {https://ceur-ws.org/Vol-3262/paper4.pdf}
}

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
figs		figs
src		src
.babelrc		.babelrc
.eslintrc.js		.eslintrc.js
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
codemeta.json		codemeta.json
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Consistency between Wikidata and Wikipedia Categories

Candidate Generation and Evaluation

Cache Population

From Dumps

From Endpoints

Candidate Generation and Cleaning

Evaluation

Citation

License

About

Releases 3

Packages

Contributors 3

Languages

License

fusion-jena/wiki-category-consistency

Folders and files

Latest commit

History

Repository files navigation

Consistency between Wikidata and Wikipedia Categories

Candidate Generation and Evaluation

Cache Population

From Dumps

From Endpoints

Candidate Generation and Cleaning

Evaluation

Citation

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 3

Languages

Packages