Skip to content

Code to rebuild the NewsQA-es dataset: a Spanish version of the NewsQA dataset

License

Notifications You must be signed in to change notification settings

pln-fing-udelar/newsqa-es

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NewsQA-es

NewsQA-es is a Spanish version of the NewsQA Dataset, created by researchers at Grupo PLN, UdelaR.

Obtaining the dataset

Due to license issues, we can't provide a download link. Therefore, here we provide the steps to re-create it by translating NewsQA. The steps:

  1. Download the NewsQA dataset. Follow the steps in the NewsQA website to download the dataset.
  2. Obtain the answers text with the tools from Maluuba NewsQA.
  3. Translate every sentence and question. Follow the steps described in the next section.
  4. Use a translation aligner to find the correspondence between each answer from NewsQA and a span of text from the translated sentence in Spanish. Follow the steps in the repo pln-fing-udelar/Mask-Align.

Translating NewsQA into Spanish

We translated the dataset using the Opus-MT model from Helsinki-NLP. To reproduce it (having already downloaded the NewsQA dataset):

  1. Clone this repo:

    git clone https://github.com/pln-fing-udelar/newsqa-es
    cd newsqa-es/
  2. Set up the environment using Conda:

    conda env create
    conda activate newsqa-es
  3. Place the extracted CNN stories from the NewsQA dataset under cnn_stories/cnn/stories:

    mkdir cnn_stories
    tar -xvf cnn_stories.tgz -C cnn_stories/
  4. Run the following command to translate the dataset. Consider that it takes time, and you may benefit from having a GPU. For reference, it takes a bit less than a day and a half on a computer with an Nvidia RTX 2080 Ti GPU. Consider changing the BATCH_SIZE constant to best fit your hardware (with a value that's too high you may incur in OOM errors; if it's too low you are underutilizing your resources, and it could be faster).

    mkdir -p cnn_stories/cnn/translated
    ./translate.py
  5. You will find the translated stories under the folder cnn_stories/cnn/translated/.

TODO: how to go from these files to the newsqa.csv file required in Mask-Align?

Contact Us

If you encounter issues following these steps, please open a GitHub issue or email us at [email protected].