Add noise to your text, inspired by Edunov et al. (2018) "Understanding Back-Translation at scale"
Made at Qwant Research during my internship
It is often a good idea to add noise to your syntetic text data, when using backtranslation for example
Edunov et al. (2018) showed that doing so can help to provide a stronger training signal
This repository contains:
- A script to reproduce experiments described by Edunov et al. (2018) in their noise approach
- A simple architecture so you can play with noise parameters or implement your own noise functions
Libraries you'll need to run the project:
{tqdm
}
Clone the repo using
git clone https://github.com/valentinmace/noisy-text.git
I've implemented the 3 noise functions described in the paper:
- Delete words with given probability (default is 0.1)
- Replace words by a filler token with given probability (default is 0.1)
- Swap words up to a certain range (default range is 3)
The default parameters are to reproduce Edunov et al. (2018) experiments but you can play with them and maybe find better values
Example of simple usage
python add_noise.py data/example --progress
Example of complete usage
python add_noise.py data/example --delete_probability 0.9 --replace_probability 0.9 --filler_token 'MASK' --permutation_range 3
Important Note
If you are using a subword tool such as SentencePiece after adding noise to your corpus, notice that your replacement token (which is 'BLANK'
by default) might be segmented into somthing like '▁B LAN K'
I recommend to make a pass on your corpus to correct it: (adapt it to your token and segmentation)
sed -i 's/▁B LAN K/▁BLANK/g' yourtextfile
I've run NMT experiments on WMT 2019 de-en corpus, using all available parallel data and adding the monolingual news-crawl 2018 via backtranslation.
After translating it from german to english to have my syntetic data, I added noise to it using this repo, giving the following results. All results are BLEU Scores
The first table reports a Transformer model identical to the "base model" in Vaswani et al. (2017), the second table reports a "Transformer Big" model, from the same paper
Model | newstest2017 | newstest2018 |
---|---|---|
baseline | 26.62 | 40.47 |
backtranslation only | 27.06 | 40.06 |
backtranslation + noise | 27.88 | 41.92 |
Transformer base model
Model | newstest2017 | newstest2018 |
---|---|---|
baseline | 29.75 | 45.8 |
backtranslation + noise | 31.33 | 47.4 |
Transformer Big model
Do not hesitate to contact me if you need some help, need a feature or see some bug
Feel free and welcome to contribute
Valentin Macé – LinkedIn – YouTube – Twitter - [email protected]
Distributed under the MIT license. See LICENSE
for more information.