Hidden Biases in Unreliable News Detection Datasets
Official Code for the paper:
Hidden Biases in Unreliable News Detection Datasets
Xiang Zhou, Heba Elfardy, Christos Christodoulopoulos, Thomas Butler and Mohit Bansal
EACL 2021
The code is tested on Python 3.7 and PyTorch 1.6.0
Other dependencies are listed in requirements.txt
and can be installed by running pip install -r requirements.txt
The experiments and results in our paper mainly involve two datasets: NELA and FakeNewsNet
For the NELA dataset, we use both the 2018 version and the 2019 version. To reproduce experiments, please first download both versions (on the download page, please select all and choose the original format) and put it under the data
directory. Then, decompress nela/2018/articles.tar.gz
and nela/2019/nela-gt-2019-json.tar.bz2
and put them under the original directory. The structure of data
should look like this:
data
└── nela
├── 2018
│ ├── articles
│ │ └── ...
│ ├── articles.db.gz
│ ├── articles.tar.gz
│ ├── labels.csv
│ ├── labels.txt
│ ├── nela_gt_2018-new_schema.tar.bz2
│ ├── README.md
│ └── titles.tar.gz
└── 2019
├── labels.csv
├── nela-eng-2019
│ └── ...
├── nela-gt-2019-json.tar.bz2
├── nela-gt-2019.tar.bz2
├── README-1.md
├── README.md
└── source-metadata.json
The FakeNewsNet dataset can be crawled using the code from its official GitHub repo. After downloading the dataset, please put it also under the data/fakenewsnet_dataset/raw
, and the whole data
folder should look like this:
data
├── fakenewsnet_dataset
│ └── raw
└── nela
└── ...
The default location of data
directory is under the root directory. If you prefer storing your data in other locations, you can change the variables in constants.py
To create the random/site/time split of NELA in the paper, run python data_helper.py nela {site, time, random}
To create the random label split, run python data_helper.py nela random_label
(Note you have to manually rename the split dataset after creating the random label split)
To create the split of FakeNewsNet in the paper, run python data_helper.py fnn {site, time, random}
Example scripts to train baseline models used in this paper can be found under the scripts
directory (Please refer to Sec. 4.1 in the paper for detailed descriptions of the baseline models). You can change the dataset path to train different baselines.
To train the logistic regression baseline, run bash scripts/lr.sh
To train the title-only RoBERTa models, run bash scripts/roberta_title.sh
To train the title+article RoBERTa models, run bash scripts/roberta_title_article.sh
- Get the predictions on the validation set (by running eval commands in the model training scripts).
- To get source-level accuracies, run
python source_evaluation.py --pred_file [PREDICTION_FILE] --key_file [KEY_FILE] --pred_type [PRED_TYPE]
. SetPRED_TYPE
toclean
for the logistic regression model and the title-only RoBERTa model andfull
for the title+article RoBERTa model due to different output formats. Please refer to the python file for the details of other arguments.
- Train a logistic regression model using
bash scripts/lr.sh
and save the trained model by adding thesave_model [MODEL_PATH]
argument. - To extract salient features from logistic regression baselines, run
python analysis_lr.py --model_path [MODEL_PATH]
. Please refer to the python file for the details of other arguments.
- Create 5 different domain splits using different seeds by running
python data_helper.py nela site [SEED]
- To get site similarity results in Table 7 in the paper, train 5 title+article baselines on each of these 5 different domain splits by running
bash scripts/roberta_title_article.sh
on and put all the predictions under theoutput
directory. Change theSAVE_DIRS
and theSITE_PREDS
variables insite_similarity.py
to match your saved path and runpython site_similarity.py
- Save the titles with correct or wrong predictions in file
correct.title
andwrong.title
respectively by runningpython dump_titles.py --pred_file [PREDICTION_FILE] --key_file [KEY_FILE] --pred_type [PRED_TYPE]
. SetPRED_TYPE
toclean
for the logistic regression model and the title-only RoBERTa model andfull
for the title+article RoBERTa model due to different output formats. Then, putcorrect.title
andwrong.title
in the same directory asdraw_cloud_unigram.py
. - To draw the word cloud showing the most salient words in examples with correct or wrong (determined by the PRINT_TYPE variable in the script) prediction, run
python draw_cloud_unigram.py
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.