-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Czech restaurant (and possibly DART) BLEU seems suspiciously low #27
Comments
@tuetschek Yes, the BLEU for DART should be >45%, so it might be a good idea to update the v1 paper as it is a bug. I've been looking at DART, E2E Clean, and WEBNLG+ 2020 as am familiar with them. I get close to zero scores when using a general submission. Not sure the exact issue, as I pass in the keys. One thing I noticed was for the E2E Clean test and validation set, there are identical inputs, so there are 4693 examples, rather than 1847 with multiple references, which would affect the metric calculation. |
@jordiclive That's weird... yeah 4693 sounds like the multiple references are ignored? Can you please send me the submission file you're trying this with? |
@tuetschek I think for E2E it's to do with the datasets loading script. If you take datasets.load_dataset('GEM/e2e_nlg'), it is flattened, (1 ref each). So we'd have to edit https://huggingface.co/datasets/GEM/e2e_nlg/blob/main/e2e_nlg.py ... I know it is your dataset :) and as I used it for a paper, I'm happy to try and fix it. If you agree, I think it should load as 1847 examples with varying no. of multiple refs? It might be easier to store the csvs like that. For example, I know for BigScience they are trying to use E2E Clean, but each 4693 is scored independently with one reference. Thanks! For the submission script, I would be very interested why it doesn't work for these datasets. I will send you my general submission script via email, which I submitted recently, and the separate reference files, where it works as expected (including DART) if you pass them in and the predictions. |
@jordiclive I'm sorry for non-responsiveness, I got piled under teaching and admin, now I'm trying to catch up a little. Are you still able to have a look at the HF loader for E2E? It would be great if you could fix it! I'll finally have a look at your email with the submission script. |
@jordiclive OK it looks like the problem is different for each set:
I'm not exactly sure how to solve DART & WebNLG – @sebastianGehrmann , do you know why DART is now not used at all? And what do you propose to do with WebNLG – do we start looking at the |
@tuetschek No worries. Thanks for investigating!! lol quite coincidental. |
@tuetschek I changed the E2E loading script here to use multi-references. @sebastianGehrmann can we merge this? I investigated the I followed this format for the E2E. Therefore, I think we should write some GEM code to start looking at the references key? as otherwise, we would have to change lots of datasets on HF side and affect backward compat. |
@tuetschek I made a patch fix that seems to solve the issue by looking for references key and using references when available #97.. hopefully, that will fix other datasets that use references key e.g. asset_turk as well as DART, webnlg. For E2E, the problem stems from how dataset is stored, if here is merged, then the e2e test dataset also needs to be changed in references repo: with this |
@jordiclive Thanks for fixing everything! Will you get notified when/if the E2E fix gets merged? Can I support the merge somehow 😀? |
@tuetschek, I don't think I will, not sure how to create MRs for these 😅. It's two separate HF datasets repos: GEM/e2e_nlg for e2e fix and references to change the reference dataset. |
@jordiclive I can see a "New Pull Request" button on the "Community" tab for both – is that the place? Not sure how you can transform your branch into the PR, but I guess it's possible to do using the command-line PR workflow? |
Thanks! was struggling to change one file in references without downloading every single dataset. Hopefully can close this issue when https://huggingface.co/datasets/GEM/references/discussions/1 is resolved. |
@jordiclive That's great, thanks! I guess I'll still have to have a look at Czech restaurant, but I hope to do that in the next few days 🙂 |
OK so far it looks like CS Restaurants has also problems with some files, so I opened another issue there (when the files are OK, the scores seem fine now). |
BLEU of 2-3% is very low – maybe there's a problem with the references and/or the outputs (wrong order etc.)?
The text was updated successfully, but these errors were encountered: