Czech restaurant (and possibly DART) BLEU seems suspiciously low #27

tuetschek · 2021-02-03T18:03:39Z

BLEU of 2-3% is very low – maybe there's a problem with the references and/or the outputs (wrong order etc.)?

jordiclive · 2022-04-24T08:35:21Z

@tuetschek Yes, the BLEU for DART should be >45%, so it might be a good idea to update the v1 paper as it is a bug.

I've been looking at DART, E2E Clean, and WEBNLG+ 2020 as am familiar with them. I get close to zero scores when using a general submission. Not sure the exact issue, as I pass in the keys.

One thing I noticed was for the E2E Clean test and validation set, there are identical inputs, so there are 4693 examples, rather than 1847 with multiple references, which would affect the metric calculation.

tuetschek · 2022-04-25T21:14:08Z

@jordiclive That's weird... yeah 4693 sounds like the multiple references are ignored? Can you please send me the submission file you're trying this with?

jordiclive · 2022-04-26T19:42:47Z

@tuetschek I think for E2E it's to do with the datasets loading script. If you take datasets.load_dataset('GEM/e2e_nlg'), it is flattened, (1 ref each). So we'd have to edit https://huggingface.co/datasets/GEM/e2e_nlg/blob/main/e2e_nlg.py ...

I know it is your dataset :) and as I used it for a paper, I'm happy to try and fix it. If you agree, I think it should load as 1847 examples with varying no. of multiple refs? It might be easier to store the csvs like that. For example, I know for BigScience they are trying to use E2E Clean, but each 4693 is scored independently with one reference.

Thanks! For the submission script, I would be very interested why it doesn't work for these datasets. I will send you my general submission script via email, which I submitted recently, and the separate reference files, where it works as expected (including DART) if you pass them in and the predictions.

tuetschek · 2022-07-01T09:55:50Z

@jordiclive I'm sorry for non-responsiveness, I got piled under teaching and admin, now I'm trying to catch up a little. Are you still able to have a look at the HF loader for E2E? It would be great if you could fix it! I'll finally have a look at your email with the submission script.

tuetschek · 2022-07-01T11:36:36Z

@jordiclive OK it looks like the problem is different for each set:

For DART, it looks like GEM-metrics is currently set not to compute scores on DART at all. The corresponding line in the config is commented out. I don't know the reason for this unfortunately.
For E2E, it's exactly as you say – everything is single-ref and needs fixing, preferrably at Huggingface.
For WebNLG, the problem seems to be compatibility with Huggingface. GEM-metrics is set to load the references from Huggingface, and in the HF version, the WebNLG references file has target and references keys for each instance. The target key is single-ref, it's just the first element of references. GEM-metrics always just looks for target, references is an unknown key for it.

I'm not exactly sure how to solve DART & WebNLG – @sebastianGehrmann , do you know why DART is now not used at all? And what do you propose to do with WebNLG – do we start looking at the references key? Or do we fix it on HF as well? I don't see the point of having target and references in this way, it's just confusing and you should always want to use multi-ref 🤔.

jordiclive · 2022-07-01T15:33:25Z

@tuetschek No worries. Thanks for investigating!! lol quite coincidental.

jordiclive · 2022-07-09T12:58:23Z

@tuetschek I changed the E2E loading script here to use multi-references. @sebastianGehrmann can we merge this?

I investigated the target and references:
For each loading script, you have to define the types of the dataset 'features' for all splits (there could be a way to define separate feature types for each subset). So that is probably why target and references were made, as for training you want target to be [str] and references List[str]. I actually see this in several of the GEM datasets. For training, the references are not to be used (empty lists) and for testing and validation, the target is not meant to be used.

I followed this format for the E2E. Therefore, I think we should write some GEM code to start looking at the references key? as otherwise, we would have to change lots of datasets on HF side and affect backward compat.

jordiclive · 2022-07-09T12:58:41Z

@tuetschek I made a patch fix that seems to solve the issue by looking for references key and using references when available #97.. hopefully, that will fix other datasets that use references key e.g. asset_turk as well as DART, webnlg.

For E2E, the problem stems from how dataset is stored, if here is merged, then the e2e test dataset also needs to be changed in references repo: with this

tuetschek · 2022-07-11T16:37:01Z

@jordiclive Thanks for fixing everything! Will you get notified when/if the E2E fix gets merged? Can I support the merge somehow 😀?

jordiclive · 2022-07-11T18:34:30Z

@tuetschek, I don't think I will, not sure how to create MRs for these 😅. It's two separate HF datasets repos: GEM/e2e_nlg for e2e fix and references to change the reference dataset.

tuetschek · 2022-07-11T18:42:48Z

@jordiclive I can see a "New Pull Request" button on the "Community" tab for both – is that the place? Not sure how you can transform your branch into the PR, but I guess it's possible to do using the command-line PR workflow?

jordiclive · 2022-07-11T20:17:39Z

Thanks! was struggling to change one file in references without downloading every single dataset. Hopefully can close this issue when https://huggingface.co/datasets/GEM/references/discussions/1 is resolved.

tuetschek · 2022-07-12T09:45:28Z

@jordiclive That's great, thanks! I guess I'll still have to have a look at Czech restaurant, but I hope to do that in the next few days 🙂

tuetschek · 2022-07-13T16:33:35Z

OK so far it looks like CS Restaurants has also problems with some files, so I opened another issue there (when the files are OK, the scores seem fine now).

tuetschek self-assigned this Feb 3, 2021

madaan added the bug Something isn't working label Dec 16, 2021

jordiclive mentioned this issue Jul 9, 2022

Use multi-references #97

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Czech restaurant (and possibly DART) BLEU seems suspiciously low #27

Czech restaurant (and possibly DART) BLEU seems suspiciously low #27

tuetschek commented Feb 3, 2021

jordiclive commented Apr 24, 2022

tuetschek commented Apr 25, 2022

jordiclive commented Apr 26, 2022 •

edited

Loading

tuetschek commented Jul 1, 2022

tuetschek commented Jul 1, 2022 •

edited

Loading

jordiclive commented Jul 1, 2022 •

edited

Loading

jordiclive commented Jul 9, 2022 •

edited

Loading

jordiclive commented Jul 9, 2022 •

edited

Loading

tuetschek commented Jul 11, 2022

jordiclive commented Jul 11, 2022

tuetschek commented Jul 11, 2022

jordiclive commented Jul 11, 2022 •

edited

Loading

tuetschek commented Jul 12, 2022

tuetschek commented Jul 13, 2022

Czech restaurant (and possibly DART) BLEU seems suspiciously low #27

Czech restaurant (and possibly DART) BLEU seems suspiciously low #27

Comments

tuetschek commented Feb 3, 2021

jordiclive commented Apr 24, 2022

tuetschek commented Apr 25, 2022

jordiclive commented Apr 26, 2022 • edited Loading

tuetschek commented Jul 1, 2022

tuetschek commented Jul 1, 2022 • edited Loading

jordiclive commented Jul 1, 2022 • edited Loading

jordiclive commented Jul 9, 2022 • edited Loading

jordiclive commented Jul 9, 2022 • edited Loading

tuetschek commented Jul 11, 2022

jordiclive commented Jul 11, 2022

tuetschek commented Jul 11, 2022

jordiclive commented Jul 11, 2022 • edited Loading

tuetschek commented Jul 12, 2022

tuetschek commented Jul 13, 2022

jordiclive commented Apr 26, 2022 •

edited

Loading

tuetschek commented Jul 1, 2022 •

edited

Loading

jordiclive commented Jul 1, 2022 •

edited

Loading

jordiclive commented Jul 9, 2022 •

edited

Loading

jordiclive commented Jul 9, 2022 •

edited

Loading

jordiclive commented Jul 11, 2022 •

edited

Loading