This repository houses the experiments we conducted to evaluate the effectiveness of Submodular Subset Selection for Long-Document Question Answering, as part of the CS 769 course project at IIT Bombay (2023 offering) under the guidance of Prof. Ganesh Ramakrishnan.
- Adithya Bhaskar (190050005)
- Harshit Varma (190100055)
The requirements under requirements.txt
may be installed with
pip install -r requirements.txt
with the exception of the submodlib
library for the installation of which we defer instructions to its repository. Configuration parameters for subset selection can be modified in config.py
, while global variables are in utils/globals.py
. Submodular optimization functions defined in utils/submodular.py
call upon embedders in models/embedder.py
to embed sentences. In turn, helper functions in models/subset_selection.py
use the assistance of utils/data.py
to load the data, then call the appropriate functions in utils/submodular.py
to select and save spans of sentences. The file main.py
brings together all the pieces of subset selection and allows one to change the objective used for the same. Once the subsets have been selected for each data instance, scripts in training/scripts/
may be used to prepare the same for model training or inference. Both proceed by calling training/lrqa/run_lrqa.py
with a configuration file inside training/configs/
. Prior to calling run_lrqa.py
, add training/
to your PYTHONPATH
, if not calling from the same directory.
Note: As GitHub has restrictions on repository size, we are unable to upload the datasets or the subsets we obtain from our methods. Running
main.py
with the appropriate parameters generates the subsets underdata/
, and the scripts intraining/scripts/
(once modified to use the above paths) prepare the training data undertraining/data
.
Long-Document Question Answering involves answering questions based on documents that are too long to fit into the context of standard Transformer-based models. Conventionally, special architectures have been developed to tackle this task that allow long inputs. However, an alternative approach that is also popular is to select a subset of sentences or paragraphs (called spans henceforth) from the source document, and pass that to a much more readily fine-tuned model (in our case, DeBERTaV3 [1]). The filtration of spans is typically done by ranking them based on similarity to the query in terms of a metric such as TF-IDF or BM25, and then selecting the top
We use the QuALITY dataset [2] in our experiments. The train
and dev
splits of QuALITY are publicly available, and have the following characteristics:
Split | No. of Articles | Avg. Article Length (Characters) | No. of Questions |
---|---|---|---|
train | 300 | 24348.86 | 2523 |
dev | 230 | 24306.59 | 2086 |
From the dev split, we select 625
questions randomly to form the validation split, and 1461
to form the test split. Henceforth, we shall refer to the former as the dev
split and the latter as the test
split, at the risk of a mild abuse of definition.
Each question of the QuALITY dataset is of the multiple-choice type, with exactly four options. As QuALITY consists largely of text from novels, most questions require knowledge that can only be pieced together from numerous spans, and are rather hard even for time-pressed human annotators.
Given a ground set
It turns out that submodular functions possess helpful properties that allow efficient algorithms for their optimization in the face of budgetary constraints. Readers are referred to sources such as the CS 769 course notes of IIT Bombay or this tutorial for more information. It will suffice here to say that Greedy methods are provably near-optimal for maximizing monotone submodular functions.
Of special interest to us is that a number of submodular (and often, monotone) objectives may be defined on any arbitrary similarity function that capture either the faithfulness of a selection to a query or the extent to which it represents the ground set of spans. We are, in particular, interested here in the objectives that are based on Submodular Mutual Information.
We use the submodlib [4] library for the implementations of the submodular objectives. More specifically, we evaluate the Mutual Information based versions of the Facility Location, Graph Cut and Log Determinant objectives, along with the Concave-over-Modular objective. Note that for the case of a single query (which holds always for us), Graph Cut is equivalent to choosing the greedy selection of the most similar spans. For the latter two, we also experiment with an initial filtration step which generates a selection
We use Facebook DPR [5] to generate embeddings for spans prior to optimization, although we also support Sentence Transformers [6]. Our spans are contiguous sets of up to five sentences. In addition, we include TF-IDF and BM25 based greedy baselines for comparison. We also support the use of other objectives with TF-IDF and BM25 based embeddings, but we do not compare to them, as we found them to anecdotally perform similar to the greedy objective. We use sklearn for the former, and the rank-bm25 [7] Python library for the latter. Our budget is
We noticed that the best-performing methods at the QuALITY leaderboard [3] included a pre-training step on the RACE [8] dataset, so we also include this step for all of our approaches. We also include a fine-tuning step on the train
split of QuALITY for each approach on selections, based on the same approach. Our training code is a modified version of the LRQA [9] code. The hyperparameters for each can be found under the corresponding configuration file inside training/configs/
.
The accuracies we obtained on our test
split for various approaches are as follows:
Approach | Accuracy (%) |
---|---|
Graph Cut | 55.10 |
Facility Location | 50.24 |
Log Determinant | 54.55 |
Filtration + Log Determinant | 52.29 |
Concave-Over-Modular | 55.99 |
Filtration + Concave-Over-Modular | 56.14 |
BM25 | 52.91 |
TF-IDF | 55.44 |
Random | 50.38 |
No context | 44.55 |
We would like to make a few remarks on the above observations. First, we notice that even without any context, DeBERTAV3 achieves pretty strong performance. Second, adding a randomly chosen context of 10 spans (around 40-50 sentences) helps as much as going from a random baseline to the best performing system. However, our numbers for the Graph Cut (greedy most-similar) objective roughly match those reported on the leaderboard under the "DeBERTaV3 + DPR" heading. We observe that Concave-Over-Modular, especially with filtration, outperforms Graph Cut and gets the best results, albeit by a small margin. Most submodular approaches outperform BM25, but TF-IDF surprisingly achieves similar scores. We also observe that Facility Location performs particularly poorly, as it highly prefers diversity over query-relevance.
[1] DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing, Pengcheng He, Jianfeng Gao, Weizhu Chen
[2] QuALITY: Question Answering with Long Input Texts, Yes!, Pang et. al.
[3] QuALITY Leaderboard, last accessed April 26, 2023.
[4] Submodlib: A Submodular Optimization Library, Kaushal, Ramakrishnan and Iyer
[5] Dense Passage Retrieval for Open-Domain Question Answering, Karpukhin et. al.
[6] Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, Reimers and Gurevych
[7] The rank-bm25 Python library, last accessed April 26, 2023.
[8] RACE: Large-scale ReAding Comprehension Dataset From Examinations, Lai et. al.