This is 10th Solution For Kaggle LLM Science Competition
Here is Original Discussion
- First of all, we want to express our gratitude to all the amazing Kagglers who have contributed in creating a larger collection of public datasets.
- Finally, we ended up with around
360k
data in total, but because of time constraint, we only utilized approximately303k
data for fine-tuning our models. - We generated around
150k
data using GPT-3.5 according to lots of Wikipedia page contents, and combined it with 99k Dataset, 60k Dataset, 17k MMLU Dataset and 70k Dataset. - Here: Code To Generate Dataset and Whole Dataset
-
We tried
all-MiniLM-L6
,all-MiniLM-L12
,gte-small
andbge-small-en
. -
We tested different models when we were at
public leaderboard
score 0.843.Before
After
all-MiniLM-L6
0.843 all-MiniLM-L12
0.843 0.837 gte-small
0.843 0.851 bge-small-en
0.843 0.822 Finally we chose
gte-small
as our sentence transformer model.
- We improved our models' performance by extracting new contexts for each dataset, using the parameters
pages=10
andsentences=30
. - To ensure that search documents contain more relevant knowledge when testing on a private dataset, we extract the contents of Wikipedia pages from HTML and re-cluster them to obtain additional articles, thereby preventing any inconsistencies.
- We combined three types of retrievers to obtain the final outcome: 1. Extracting context from the entire Wikipedia page, 2. Regarding the two other retrievers proposed by MB, we have expanded the value
target_articles
in the shared notebook by MB. We manually extracted the titles of the dataset, which is considered a useful validation, and added them to thetarget_articles
. Additionally, we have selected more clusters to help us discover more relevant data, increasing it from 270K to approximately 6 million rows. Furthermore, we are utilizing both TF-IDF and sentence transformer techniques for context extraction.
-
Training Dataset: We divided the entire dataset into multiple chunks to enhance the robustness of our ensemble models. For instance, we split the dataset into four parts and trained four models on each large dataset formed by combining three smaller portions. Additionally, we also trained a model using the entire dataset for comparison. Here: 222k-133k-5fold, 222k-148k-3fold, 303k-202k-3fold.
-
Validation Dataset: We randomly selected a 2k dataset from the entire dataset we generated using GPT-3.5 and combined it with 200 original training datasets to create our validation dataset. In the end, we selected our final submissions according to its performance on 500 Valid Dataset.
-
Parameters: We are incredibly grateful that our teammate has 8 * A100 GPUs. This allowed us to set the max_length to either 512 or 768 for training different models, which could be more suitable. Additionally, we trained some models using fp32, which took approximately four days to train 222k data on a single A100 GPU. Here: 222k-133k-models, 222k-148k-models, 303k-202k-models, 222k-768-fp32-model, 303k+50k-finetune-model
-
Adversarial Weight Perturbation: We incorporated adversarial weight perturbation into our training approach to boost the robustness of each individual model, resulting in a further reduction in our training loss. Here: Training Notebook.
Parameters Value per_device_train_batch_size
2 per_device_eval_batch_size
1 learning_rate
4e-6 lr_end
8e-7 num_train_epochs
2 or 3 gradient_accumulation_steps
16 weight_decay
1e-4 MAX_INPUT
512 or 768 dropout_rate
0 awp_lr
0.1 awp_eps
1e-4 awp_start_epoch
0.5 The above parameters are the best tested on a single A100 GPU.
-
Ensemble Model: We used optuna to test the ensemble score on 500 validation dataset. We achieved a score of 0.915 on the GPT-3.5 500k dataset and 0.909 on the GPT-4 500k dataset. Because the test data set was generated using GPT-3.5, we finally chose the combination with a higher score under GPT-3.5. Submissions: Submission1, Submission2.
Combination Ratio Public Leaderboard Private Leaderboard Choose 222k-full/checkpoint-12300
+133k fold5/checkpoint-7100
2: 1 0.929 0.921 No 222k-full/checkpoint-12300
+768_fp32_222k/checkpoint-27800
+133k fold5/checkpoint-7100
2: 1.5: 1 0.930 0.923 Yes 222k_finetune/checkpoint-800
+222k_relevant/checkpoint-12700
1: 1 0.930 0.924 No 148k fold1/checkpoint-5300
+202k_fold3_512/checkpoint-12500
+202k_fold2_512/checkpoint-12500
+768-300k-best-para-epoch2/checkpoint-15600
1: 1: 1: 1 0.929 0.922 No 222k-full/checkpoint-12300
+768_fp32_222k/checkpoint-27800
+133k fold5/checkpoint-7100
2: 1.5: 1 0.929 0.924 No 133k fold5/checkpoint-7100
+303k-50k-finetune/checkpoint-2800
+singlegpu-200k-run-sft-awp-0-1-ls-dropout-4e6/checkpoint-11300
+768-300k-best-para-epoch2/checkpoint-18900
1: 1: 1: 1 0.930 0.923 Yes