This repository houses the resources for LOFT, the Long Context Frontiers benchmark, introduced in the research paper Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?. LOFT consists of 6 long-context task categories spanning retrieval, multi-hop compositional reasoning, and more, totaling 35 datasets and 4 modalities.
$ git clone [email protected]:google-deepmind/loft.git
$ cd loft/
$ pip install -r requirements.txt
The script below downloads all the LOFT datasets under BASE_DIR
.
$ BASE_DIR=your-choice-of-directory
$ sh download.sh $BASE_DIR
Each dataset is also available from the links in the Datasets table.
For a small subset, download.sh
will additionally run preprocess.py
, which infills the missing fields in the queries and corpus files.
Once the download is completed, you will see the file structure as below:
$BASE_DIR
│
└───data
│ └───retrieval
│ │ └───arguana
│ │ │ └───32k
│ │ │ │ └───corpus.jsonl
│ │ │ │ └───dev_queries.jsonl
│ │ │ │ └───few_shot_queries.jsonl
│ │ │ └───128k
│ │ │ └───1m
│ │ └───fever
│ │ └───...
| │
│ └───rag
│ └───sql
│ └───icl
│
└───prompts
└───retrieval_128k
│ retrieval_arguana_128k.txt
│ retrieval_fever_128k.txt
└───...
The data
folder contains the LOFT datasets and the prompts
folder contains samples of prompts used in LOFT.
We also provide an example prompt in PROMPT_EXAMPLE.txt
showing how Corpus-in-Context (CiC) prompting can be done for the text retrieval task.
We currently support using gemini-1.5-flash-002
from VertexAI for inference.
Please prepare your PROJECT_ID
from Google Cloud.
To run the inference with gemini-1.5-flash-002
and evaluate predictions:
DATASET=msmarco
PROJECT_ID=your-gcp-project-id
python run_inference.py \
--prompt_prefix_path ${BASE_DIR}/prompts/retrieval_128k/retrieval_${DATASET}_128k.txt \
--data_dir ${BASE_DIR}/data/retrieval/${DATASET}/128k \
--split dev \
--context_length 128k \
--output_path ${BASE_DIR}/outputs/retrieval/${DATASET}/128k/predictions.jsonl \
--project_id ${PROJECT_ID}
python run_evaluation.py \
--answer_file_path ${BASE_DIR}/data/retrieval/${DATASET}/128k/dev_queries.jsonl \
--pred_file_path ${BASE_DIR}/outputs/retrieval/${DATASET}/128k/predictions.jsonl \
--task_type retrieval
The same script can be found from infer_eval.sh
.
We provide example queries and predictions files in evaluation/example_predictions/.
Each task_type
outputs many different metric scores.
To understand which task_type
to use for each dataset and also to see the primary evaluation metric reported in the paper for each dataset, see the Datasets table.
Task | Dataset | Description | Task Type | Primary Metric | Infilling Needed? | Download |
---|---|---|---|---|---|---|
Text Retrieval | ArguAna | Argument Retrieval | retrieval |
recall@1 |
- | Link |
Text Retrieval | FEVER | Fact Checking | retrieval |
recall@1 |
- | Link |
Text Retrieval | FIQA | Question Answering | retrieval |
recall@1 |
✅ | Link |
Text Retrieval | MS MARCO | Web Search | retrieval |
recall@1 |
✅ | Link |
Text Retrieval | NQ | Question Answering | retrieval |
recall@1 |
- | Link |
Text Retrieval | Quora | Duplication Detection | retrieval |
recall@1 |
✅ | Link |
Text Retrieval | SciFact | Citation Prediction | retrieval |
recall@1 |
- | Link |
Text Retrieval | Touché-2020 | Argument Retrieval | retrieval |
recall@1 |
✅ | Link |
Text Retrieval | TopiOCQA | Multi-turn QA | retrieval |
recall@1 |
- | Link |
Text Retrieval | HotPotQA | Multi-hop QA | retrieval |
mrecall@2 |
- | Link |
Text Retrieval | MuSiQue | Multi-hop QA | retrieval |
mrecall@5 |
- | Link |
Text Retrieval | QAMPARI | Multi-target QA | retrieval |
mrecall@5 |
- | Link |
Text Retrieval | QUEST | Multi-target QA | retrieval |
mrecall@3 |
- | Link |
Visual Retrieval | Flickr30k | Image Retrieval | retrieval |
recall@1 |
✅ | Coming Soon |
Visual Retrieval | MS COCO | Image Retrieval | retrieval |
recall@1 |
✅ | Coming Soon |
Visual Retrieval | OVEN | Image-text Retrieval | retrieval |
recall@1 |
- | Coming Soon |
Visual Retrieval | MSR-VTT | Video Retrieval | retrieval |
recall@1 |
✅ | Link |
Audio Retrieval | FLEURS-en | Audio Retrieval | retrieval |
recall@1 |
- | Coming Soon |
Audio Retrieval | FLEURS-es | Audio Retrieval | retrieval |
recall@1 |
- | Coming Soon |
Audio Retrieval | FLEURS-fr | Audio Retrieval | retrieval |
recall@1 |
- | Coming Soon |
Audio Retrieval | FLEURS-hi | Audio Retrieval | retrieval |
recall@1 |
- | Coming Soon |
Audio Retrieval | FLEURS-zh | Audio Retrieval | retrieval |
recall@1 |
- | Coming Soon |
RAG | NQ | Question Answering | rag |
subspan_em |
- | Link |
RAG | TopiOCQA | Multi-turn QA | rag |
subspan_em |
- | Coming Soon |
RAG | HotPotQA | Multi-hop QA | rag |
subspan_em |
- | Link |
RAG | MuSiQue | Multi-hop QA | rag |
subspan_em |
- | Link |
RAG | QAMPARI | Multi-target QA | multi_value_rag |
subspan_em |
- | Link |
RAG | QUEST | Multi-target QA | multi_value_rag |
subspan_em |
- | Link |
SQL | Spider | Single-turn SQL | sql |
exec_acc |
- | Link |
SQL | SParC | Multi-turn SQL | sql |
exec_acc |
- | Link |
Many-Shot ICL | BBH-date | Multiple-choice QA | icl |
em |
- | Link |
Many-Shot ICL | BBH-salient | Multiple-choice QA | icl |
em |
- | Link |
Many-Shot ICL | BBH-tracking7 | Multiple-choice QA | icl |
em |
- | Link |
Many-Shot ICL | BBH-web | Multiple-choice QA | icl |
em |
- | Link |
Many-Shot ICL | LIB-dialogue | Classification | - | - | ✅ | Coming Soon |
- Remaining multi-modal data.
- Prompts for RAG, SQL, and multi-modal retrieval.
- Prompt conversion code (data => prompt).
- Inference code and prompts for retrieval (10/25/24).
- Evaluation code for ICL and some ICL and visual retrieval datasets (8/30/24).
- Evaluation code for text tasks and code to regenerate some of the LOFT datasets (6/29/24).
- Initial release with links to download many of the LOFT text datasets (6/20/24).
@article{Lee2024LongContext,
title={Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?},
author={Jinhyuk Lee and Anthony Chen and Zhuyun Dai and Dheeru Dua and Devendra Singh Sachan and Michael Boratko and Yi Luan and Sébastien M. R. Arnold and Vincent Perot and Siddharth Dalmia and Hexiang Hu and Xudong Lin and Panupong Pasupat and Aida Amini and Jeremy R. Cole and Sebastian Riedel and Iftekhar Naim and Ming-Wei Chang and Kelvin Guu},
journal={ArXiv},
year={2024},
volume={abs/2406.13121},
url={https://arxiv.org/abs/2406.13121}
}
Copyright 2024 DeepMind Technologies Limited
All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0
All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode
Individual tasks may be subject to copyright and licensing from their respective owners - please see individual download files for details.
Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.
This is not an official Google product.