Skip to content

Latest commit

 

History

History
150 lines (117 loc) · 7.86 KB

README.md

File metadata and controls

150 lines (117 loc) · 7.86 KB

WARNING! This is the deprecated version. The new MERA datasets are here and the codebase is here. The old leaderboard is frozen and may be found here. It is no longer accessible for uploading submissions. All new submissions are made following the instructions on new leaderboard.

MERA

MERA

License Release

MERA (Multimodal Evaluation for Russian-language Architectures) is a new open benchmark for the Russian language for evaluating fundamental models.

About MERA

MERA benchmark brings together all industry and academic players in one place to study the capabilities of fundamental models, draw attention to AI problems, develop collaboration within the Russian Federation and in the international arena and create an independent unified system for measuring all current models. This repository is a customized version of original Language Model Evaluation Harness (LM-Harness v0.3.0).

Our contributions to this project are:

  • Instruction-based tasks available on 🤗 HuggingFace dataset card.
  • Customized version of LM-Harness evaluation code for models (v0.3.0).
  • Benchmark website with the Leaderboard and the scoring submission system.
  • Baselines of the open models and Human Benchmark.

The MERA benchmark includes 21 text tasks (17 base tasks + 4 diagnostic tasks). See the task-table for a complete list.

Name Task Name Task Type Test Size N-shots Metrics
MathLogicQA mathlogicqa Math, Logic 1143 5 Acc
MultiQ multiq Reasoning 900 0 EM / F1
PARus parus Common Sense 500 0 Acc
RCB rcb NLI 438 0 Acc / F1_macro
ruModAr rumodar Math, Logic 6000 0 Acc
ruMultiAr rumultiar Math 1024 5 Acc
ruOpenBookQA ruopenbookqa World Knowledge 400 5 Acc / F1_macro
ruTiE rutie Reasoning, Dialogue Context, Memory 430 0 Acc
ruWorldTree ruworldtree World Knowledge 525 5 Acc / F1_macro
RWSD rwsd Reasoning 260 0 Acc
SimpleAr simplear Math 1000 5 Acc
BPS bps Code, Math 1000 2 Acc
CheGeKa chegeka World Knowledge 416 4 EM / F1
LCS lcs Code, Math 500 2 Acc
ruHumanEval ruhumaneval Code 164 0 Pass@k
ruMMLU rummlu Reasoning 961 5 Acc
USE use Exam 900 0 Grade_norm
ruDetox rudetox Ethics 800 0 J(STA, SIM, FL)
ruEthics ruethics Ethics 1935 0 5 MCC
ruHateSpeech ruhatespeech Ethics 265 0 Acc
ruHHH ruhhh Ethics 178 0 Acc

Our aim is to evaluate all the models:

  • in the same scenarios;
  • using the same metrics;
  • with the same adaptation strategy (e.g., prompting);
  • provide an opportunity to make controlled and clear comparisons.

MERA is a collaborative project created in a union of industry and academia with the support of all the companies, that are creating the foundation models, to ensure fair and transparent leaderboards for the models evaluation.

We express our gratitude to our team and partners:

SberDevices, Sber AI, Yandex, Skoltech AI, MTS AI, NRU HSE, Russian Academy of Sciences, etc.

Powered by Aliance AI

Contents

The repository has the following structure:

  • examples — the examples of loading and using data.
  • humanbenchmarks — materials and code for human evaluation.
  • modules — the examples of scoring scripts that are used on the website for scoring your submission.
  • lm-evaluation-harness — a framework for few-shot evaluation of language models.

The process of submission is the following:

  • to view the datasets use the HuggingFace preview or run the prepared instruction;
  • clone MERA benchmark repository;
  • to get submission files use shell script and the provided customized lm-harness code (the actual model is not required for submission and evaluation).
  • run your model on the all datasets using the code of lm-harness: the result of the code is the archive in ZIP format for the submission;
  • register on the website;
  • upload the submission files (ZIP) via the platform interface for the automatic assessment.

Note that, the evaluation result is then displayed in the user's account and is kept private. Those who want to make their submission results public could use the ''Publish'' function. After validation of the submission is approved, the model's overall score will be shown publicly. The parameters of the generation, prompts and few-shot/zero-shot are fixed. You can vary them for your own purposes. If you want to submit your results on the public leaderboard check that these parameters are the same and please add the logs. We have to be sure that the scenarios for the models evaluation are the same and reproducible.

We provide the sample submission for you to check the format.

The process of the whole MERA evaluation is described on the Figure:

evaluation setup


📌 It’s the first text version of the benchmark. We are to expand and develop it in the future with new tasks and multimodality.

Feel free to ask any questions regarding our work, write on email [email protected]. If you have ideas and new tasks feel free to suggest them, it’s important! If you see any bugs, or you know how to make the code better please suggest the fixes via pull-requests and issues in this official github 🤗. We will be glad to get the feedback in any way.

Cite as

@inproceedings{fenogenova-etal-2024-mera,
    title = "{MERA}: A Comprehensive {LLM} Evaluation in {R}ussian",
    author = "Fenogenova, Alena  and
      Chervyakov, Artem  and
      Martynov, Nikita  and
      Kozlova, Anastasia  and
      Tikhonova, Maria  and
      Akhmetgareeva, Albina  and
      Emelyanov, Anton  and
      Shevelev, Denis  and
      Lebedev, Pavel  and
      Sinev, Leonid  and
      Isaeva, Ulyana  and
      Kolomeytseva, Katerina  and
      Moskovskiy, Daniil  and
      Goncharova, Elizaveta  and
      Savushkin, Nikita  and
      Mikhailova, Polina  and
      Minaeva, Anastasia  and
      Dimitrov, Denis  and
      Panchenko, Alexander  and
      Markov, Sergey",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-long.534",
    doi = "10.18653/v1/2024.acl-long.534",
    pages = "9920--9948",
}