NewsQA-es is a Spanish version of the NewsQA Dataset, created by researchers at Grupo PLN, UdelaR.
Due to license issues, we can't provide a download link. Therefore, here we provide the steps to re-create it by translating NewsQA. The steps:
- Download the NewsQA dataset. Follow the steps in the NewsQA website to download the dataset.
- Obtain the answers text with the tools from Maluuba NewsQA.
- Translate every sentence and question. Follow the steps described in the next section.
- Use a translation aligner to find the correspondence between each answer from NewsQA and a span of text from the translated sentence in Spanish. Follow the steps in the repo pln-fing-udelar/Mask-Align.
We translated the dataset using the Opus-MT model from Helsinki-NLP. To reproduce it (having already downloaded the NewsQA dataset):
-
Clone this repo:
git clone https://github.com/pln-fing-udelar/newsqa-es cd newsqa-es/
-
Set up the environment using Conda:
conda env create conda activate newsqa-es
-
Place the extracted CNN stories from the NewsQA dataset under
cnn_stories/cnn/stories
:mkdir cnn_stories tar -xvf cnn_stories.tgz -C cnn_stories/
-
Run the following command to translate the dataset. Consider that it takes time, and you may benefit from having a GPU. For reference, it takes a bit less than a day and a half on a computer with an Nvidia RTX 2080 Ti GPU. Consider changing the
BATCH_SIZE
constant to best fit your hardware (with a value that's too high you may incur in OOM errors; if it's too low you are underutilizing your resources, and it could be faster).mkdir -p cnn_stories/cnn/translated ./translate.py
-
You will find the translated stories under the folder
cnn_stories/cnn/translated/
.
TODO: how to go from these files to the newsqa.csv
file required in Mask-Align?
If you encounter issues following these steps, please open a GitHub issue or email us at [email protected].