Benchmark for Thai sentence representation based on Thai STS-B, Text classification, and Retrieval datasets.
Sentence representation plays a crucial role in NLP downstream tasks such as NLI, text classification, and STS. Recent sentence representation training techniques require NLI or STS datasets. However, no equivalent Thai NLI or STS datasets exist for sentence representation training. To address this problem, we create "Thai sentence vector benchmark" to demonstrate that we can train Thai sentence representation without any supervised dataset.
Our first preliminary results demonstrate that we can train a robust sentence representation model with an unsupervised technique called SimCSE. We show that it is possible to train SimCSE with 1.3 M sentences from Wikipedia within 2 hours on the Google Colab (V100), where the performance of SimCSE-XLM-R is similar to mDistil-BERT<-mUSE (train on > 1B sentences).
Moreover, we provide the Thai sentence vector benchmark. Our benchmark aims to evaluate the effectiveness of sentence embedding models on Thai zero-shot and transfer learning tasks. The tasks comprise of four tasks: Semantic ranking on STS-B, text classification (transfer), pair classification, and retrieval question answering (QA).
conda create -n thai_sentence_vector_benchmark python==3.11.4
conda activate thai_sentence_vector_benchmark
# Select the appropriate PyTorch version based on your CUDA version
# CUDA 11.8
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia
# CUDA 12.1
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=12.1 -c pytorch -c nvidia
# CPU Only
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 cpuonly -c pytorch
pip install -e .
python scripts/eval_all.py \
--cohere_api_key <YOUR_COHERE_API_KEY> \
--openai_api_key <YOUR_OPENAI_API_KEY>
from sentence_transformers import SentenceTransformer
from thai_sentence_vector_benchmark.benchmark import ThaiSentenceVectorBenchmark
model = SentenceTransformer("intfloat/e5-mistral-7b-instruct")
benchmark = ThaiSentenceVectorBenchmark()
results = benchmark(
model,
task_prompts={
"sts": "Instruct: Retrieve semantically similar text.\nQuery: ",
"retrieval": "Instruct: Given a web search query, retrieve relevant passages that answer the query\nQuery: ",
"pair_classification": "Instruct: Retrieve parallel sentences.\nQuery: ",
"text_classification": "Instruct: Classify the sentiment of the text.\nText: ",
}
)
>> {
"STS": {
"sts_b": {"Spearman_Correlation": float},
"Average": {"Spearman_Correlation": float},
},
"Text_Classification": {
"wisesight": {"Accuracy": float, "F1": float},
"wongnai": {"Accuracy": float, "F1": float},
"generated_reviews": {"Accuracy": float, "F1": float},
"Average": {"Accuracy": float, "F1": float},
},
"Pair_Classification": {
"xnli": {"AP": float},
"Average": {"AP": float},
},
"Retrieval": {
"xquad": {"R@1": float, "MRR@10": float},
"miracl": {"R@1": float, "MRR@10": float},
"tydiqa": {"R@1": float, "MRR@10": float},
"Average": {"R@1": float, "MRR@10": float},
},
"Average": float,
}
We provide simple and effective sentence embedding methods that do not require supervised labels (unsupervised learning) as follows:
- We use SimCSE:Simple Contrastive Learning of Sentence Embeddings on multilingual LM models (mBERT, distil-mBERT, XLM-R) and a monolingual model (WangchanBERTa).
- Training data: Thai Wikipedia.
- Example: SimCSE-Thai.ipynb.
- Training Example on Google Colab: https://colab.research.google.com/github/mrpeerat/Thai-Sentence-Vector-Benchmark/blob/main/SimCSE-Thai.ipynb
- We use the training objective from ConGen on various PLMs.
- Training data: scb-mt-en-th-2020
- Example: ConGen-Thai.ipynb
- We use the training objective from SCT on various PLMs.
- Training data: scb-mt-en-th-2020
- Example: SCT-Thai.ipynb
- Easy to train
- Compatible with every model
- Do not require any annotated dataset
- The best sentence representation method (for now) in terms of the performance on STS and downstream tasks (SCT outperformed ConGen and SimCSE in their paper).
We also consider other techniques (supervised and unsupervised methods) in this repository. Currently, we have various methods tested on our benchmarks, such as:
- Supervised learning: sentence-bert.
- Multilingual sentence representation alignment: CL-ReLKT (NAACL'22)
- We use STS-B translated ver. in which we translate STS-B from SentEval by using google-translate API
- How to evaluate sentence representation: Easy_Evaluation.ipynb
- How to evaluate sentence representation on Google Colab: https://colab.research.google.com/github/mrpeerat/Thai-Sentence-Vector-Benchmark/blob/main/SentEval.ipynb
Base Model | Spearman's Correlation (*100) | Supervised? | Latency(ms) |
---|---|---|---|
simcse-model-distil-m-bert | 44.27 | 7.22 ± 0.53 | |
simcse-model-m-bert-thai-cased | 43.95 | 11.66 ± 0.72 | |
simcse-model-XLMR | 63.98 | 10.95 ± 0.41 | |
simcse-model-wangchanberta | 60.95 | 10.54 ± 0.33 | |
simcse-model-phayathaibert | 68.28 | 11.4 ± 1.01 | |
SCT-model-XLMR | 68.90 | 10.52 ± 0.46 | |
SCT-model-wangchanberta | 71.35 | 10.61 ± 0.62 | |
SCT-model-phayathaibert | 74.06 | 10.64 ± 0.72 | |
SCT-Distil-model-XLMR | 78.78 | 10.69 ± 0.48 | |
SCT-Distil-model-wangchanberta | 77.77 | 10.86 ± 0.55 | |
SCT-Distil-model-phayathaibert | 77.89 | 11.01 ± 0.62 | |
SCT-Distil-model-phayathaibert-bge-m3 | 76.71 | ||
ConGen-model-XLMR | 79.69 | 10.79 ± 0.38 | |
ConGen-model-wangchanberta | 79.20 | 10.44 ± 0.5 | |
ConGen-model-phayathaibert | 78.90 | 10.32 ± 0.31 | |
ConGen-BGE_M3-model-phayathaibert | 76.82 | 10.91 ± 0.43 | |
distiluse-base-multilingual-cased-v2 | 65.37 | ✔️ | 9.38 ± 1.34 |
paraphrase-multilingual-mpnet-base-v2 | 80.49 | ✔️ | 10.93 ± 0.55 |
BGE M-3 | 77.22 | ✔️ | 23.5 ± 3.07 |
Cohere-embed-multilingual-v2.0 | 68.03 | ✔️ |
- We use Wisesight, Wongnai, and Generated review datasets.
- How to evaluate: Transfer_Evaluation
Base Model | Acc (*100) | F1 (*100, weighted) | Supervised? |
---|---|---|---|
simcse-model-distil-m-bert | 56.12 | 56.60 | |
simcse-model-m-bert-thai-cased | 55.86 | 56.65 | |
simcse-model-XLMR | 62.07 | 62.76 | |
simcse-model-wangchanberta | 64.17 | 64.39 | |
simcse-model-phayathaibert | 68.59 | 67.73 | |
SCT-model-XLMR | 67.47 | 67.62 | |
SCT-model-wangchanberta | 68.51 | 68.97 | |
SCT-model-phayathaibert | 70.80 | 68.60 | |
SCT-Distil-model-XLMR | 67.73 | 67.75 | |
SCT-Distil-model-wangchanberta | 65.78 | 66.17 | |
SCT-Distil-model-phayathaibert | 66.64 | 66.94 | |
SCT-Distil-model-phayathaibert-bge-m3 | 67.28 | 67.70 | |
ConGen-model-XLMR | 66.75 | 67.41 | |
ConGen-model-wangchanberta | 67.09 | 67.65 | |
ConGen-model-phayathaibert | 67.65 | 68.12 | |
ConGen-BGE_M3-model-phayathaibert | 68.62 | 68.92 | |
distiluse-base-multilingual-cased-v2 | 63.31 | 63.74 | ✔️ |
paraphrase-multilingual-mpnet-base-v2 | 67.05 | 67.67 | ✔️ |
BGE M-3 | 68.36 | 68.92 | ✔️ |
Cohere-embed-multilingual-v2.0 | 66.72 | 67.24 | ✔️ |
Base Model | Acc (*100) | F1 (*100, weighted) | Supervised? |
---|---|---|---|
simcse-model-distil-m-bert | 34.31 | 35.81 | |
simcse-model-m-bert-thai-cased | 37.55 | 38.29 | |
simcse-model-XLMR | 40.46 | 38.06 | |
simcse-model-wangchanberta | 40.95 | 37.58 | |
simcse-model-phayathaibert | 37.53 | 38.45 | |
SCT-model-XLMR | 42.88 | 44.75 | |
SCT-model-wangchanberta | 47.90 | 47.23 | |
SCT-model-phayathaibert | 54.73 | 49.48 | |
SCT-Distil-model-XLMR | 46.16 | 47.02 | |
SCT-Distil-model-wangchanberta | 48.61 | 44.89 | |
SCT-Distil-model-phayathaibert | 48.86 | 48.14 | |
SCT-Distil-model-phayathaibert-bge-m3 | 45.95 | 47.29 | |
ConGen-model-XLMR | 44.95 | 46.57 | |
ConGen-model-wangchanberta | 46.72 | 48.04 | |
ConGen-model-phayathaibert | 45.99 | 47.54 | |
ConGen-BGE_M3-model-phayathaibert | 47.98 | 49.22 | |
distiluse-base-multilingual-cased-v2 | 37.76 | 40.07 | ✔️ |
paraphrase-multilingual-mpnet-base-v2 | 45.20 | 46.72 | ✔️ |
BGE M-3 | 51.94 | 52.68 | ✔️ |
Cohere-embed-multilingual-v2.0 | 46.83 | 48.08 | ✔️ |
Base Model | Acc (*100) | F1 (*100, weighted) | Supervised? |
---|---|---|---|
simcse-model-distil-m-bert | 39.11 | 37.27 | |
simcse-model-m-bert-thai-cased | 38.72 | 37.56 | |
simcse-model-XLMR | 46.27 | 44.22 | |
simcse-model-wangchanberta | 37.37 | 36.72 | |
simcse-model-phayathaibert | 48.76 | 45.14 | |
SCT-model-XLMR | 55.93 | 54.19 | |
SCT-model-wangchanberta | 50.39 | 48.65 | |
SCT-model-phayathaibert | 54.90 | 48.36 | |
SCT-Distil-model-XLMR | 56.76 | 55.50 | |
SCT-Distil-model-wangchanberta | 52.33 | 48.41 | |
SCT-Distil-model-phayathaibert | 54.35 | 52.23 | |
SCT-Distil-model-phayathaibert-bge-m3 | 58.95 | 57.64 | |
ConGen-model-XLMR | 57.93 | 56.66 | |
ConGen-model-wangchanberta | 58.67 | 57.51 | |
ConGen-model-phayathaibert | 58.43 | 57.23 | |
ConGen-BGE_M3-model-phayathaibert | 59.66 | 58.37 | |
distiluse-base-multilingual-cased-v2 | 50.62 | 48.90 | ✔️ |
paraphrase-multilingual-mpnet-base-v2 | 57.48 | 56.35 | ✔️ |
BGE M-3 | 59.53 | 58.35 | ✔️ |
Cohere-embed-multilingual-v2.0 | 57.91 | 56.60 | ✔️ |
- We use XNLI dev and test set. We drop neutral classes and change from contradiction => 0 and entailment =>1.
- We use the average precision score as the main metric.
- How to evaluate: XNLI_evaluation.ipynb
Base Model | Dev (AP) | Test (AP) | Supervised? |
---|---|---|---|
simcse-model-distil-m-bert | 57.99 | 56.06 | |
simcse-model-m-bert-thai-cased | 58.41 | 58.09 | |
simcse-model-XLMR | 62.05 | 62.05 | |
simcse-model-wangchanberta | 58.13 | 59.01 | |
simcse-model-phayathaibert | 62.10 | 63.34 | |
SCT-model-XLMR | 64.53 | 65.29 | |
SCT-model-wangchanberta | 66.36 | 66.79 | |
SCT-model-phayathaibert | 65.35 | 65.84 | |
SCT-Distil-model-XLMR | 78.40 | 79.14 | |
SCT-Distil-model-wangchanberta | 77.06 | 76.75 | |
SCT-Distil-model-phayathaibert | 77.95 | 77.61 | |
SCT-Distil-model-phayathaibert-bge-m3 | 75.18 | 74.83 | |
ConGen-model-XLMR | 80.68 | 80.98 | |
ConGen-model-wangchanberta | 82.24 | 81.15 | |
ConGen-model-phayathaibert | 80.89 | 80.51 | |
ConGen-BGE_M3-model-phayathaibert | 76.72 | 76.13 | |
distiluse-base-multilingual-cased-v2 | 65.35 | 64.93 | ✔️ |
paraphrase-multilingual-mpnet-base-v2 | 84.14 | 84.06 | ✔️ |
BGE M-3 | 79.09 | 79.02 | ✔️ |
Cohere-embed-multilingual-v2.0 | 60.25 | 61.15 | ✔️ |
- We use XQuAD, MIRACL, and TyDiQA datasets.
- How to evaluate: Retrieval_Evaluation
Base Model | R@1 | MRR@10 | Supervised? | Latency(second) |
---|---|---|---|---|
simcse-model-distil-m-bert | 18.24 | 27.19 | 0.61 | |
simcse-model-m-bert-thai-cased | 22.94 | 30.29 | 1.02 | |
simcse-model-XLMR | 52.02 | 62.94 | 0.85 | |
simcse-model-wangchanberta | 53.87 | 65.51 | 0.81 | |
simcse-model-phayathaibert | 73.95 | 81.67 | 0.79 | |
SCT-model-XLMR | 55.29 | 65.23 | 1.24 | |
SCT-model-wangchanberta | 66.30 | 76.14 | 1.23 | |
SCT-model-phayathaibert | 67.56 | 76.14 | 1.19 | |
SCT-Distil-model-XLMR | 68.91 | 78.19 | 1.24 | |
SCT-Distil-model-wangchanberta | 62.27 | 72.53 | 1.35 | |
SCT-Distil-model-phayathaibert | 71.43 | 80.18 | 1.21 | |
SCT-Distil-model-phayathaibert-bge-m3 | 80.50 | 86.75 | ||
ConGen-model-XLMR | 71.76 | 80.01 | 1.24 | |
ConGen-model-wangchanberta | 70.92 | 79.59 | 1.21 | |
ConGen-model-phayathaibert | 71.85 | 80.33 | 1.19 | |
ConGen-BGE_M3-model-phayathaibert | 85.80 | 90.48 | 1.3 | |
distiluse-base-multilingual-cased-v2 | 49.16 | 58.19 | ✔️ | 1.05 |
paraphrase-multilingual-mpnet-base-v2 | 71.26 | 79.63 | ✔️ | 1.24 |
BGE M-3 | 90.50 | 94.33 | ✔️ | 7.22 |
Cohere-embed-multilingual-v2.0 | 82.52 | 87.78 | ✔️ | XXX |
Base Model | R@1 | MRR@10 | Supervised? | Latency(second) |
---|---|---|---|---|
simcse-model-distil-m-bert | 28.51 | 37.05 | 4.31 | |
simcse-model-m-bert-thai-cased | 26.19 | 36.11 | 6.66 | |
simcse-model-XLMR | 34.92 | 47.51 | 6.17 | |
simcse-model-wangchanberta | 36.29 | 48.96 | 6.09 | |
simcse-model-phayathaibert | 43.25 | 57.28 | 6.18 | |
SCT-model-XLMR | 28.51 | 40.84 | 16.29 | |
SCT-model-wangchanberta | 35.33 | 48.19 | 16.0 | |
SCT-model-phayathaibert | 37.52 | 51.02 | 15.8 | |
SCT-Distil-model-XLMR | 40.38 | 51.68 | 16.17 | |
SCT-Distil-model-wangchanberta | 39.43 | 50.61 | 16.04 | |
SCT-Distil-model-phayathaibert | 45.16 | 56.52 | 15.82 | |
SCT-Distil-model-phayathaibert-bge-m3 | 64.80 | 74.46 | ||
ConGen-model-XLMR | 43.11 | 55.51 | 16.4 | |
ConGen-model-wangchanberta | 41.06 | 53.31 | 15.98 | |
ConGen-model-phayathaibert | 44.34 | 55.77 | 15.97 | |
ConGen-BGE_M3-model-phayathaibert | 70.40 | 79.33 | 15.83 | |
distiluse-base-multilingual-cased-v2 | 17.74 | 27.78 | ✔️ | 9.84 |
paraphrase-multilingual-mpnet-base-v2 | 38.20 | 49.65 | ✔️ | 16.22 |
BGE M-3 | 79.67 | 86.68 | ✔️ | 91.27 |
Cohere-embed-multilingual-v2.0 | 66.98 | 77.58 | ✔️ | XXX |
Base Model | R@1 | MRR@10 | Supervised? | Latency(second) |
---|---|---|---|---|
simcse-model-distil-m-bert | 44.69 | 51.39 | 1.6 | |
simcse-model-m-bert-thai-cased | 45.09 | 52.37 | 2.46 | |
simcse-model-XLMR | 58.06 | 64.72 | 2.35 | |
simcse-model-wangchanberta | 62.65 | 70.02 | 2.32 | |
simcse-model-phayathaibert | 71.43 | 78.16 | 2.28 | |
SCT-model-XLMR | 49.28 | 58.62 | 3.15 | |
SCT-model-wangchanberta | 58.19 | 68.05 | 3.21 | |
SCT-model-phayathaibert | 63.43 | 71.73 | 3.21 | |
SCT-Distil-model-XLMR | 56.36 | 65.18 | 3.3 | |
SCT-Distil-model-wangchanberta | 56.23 | 65.18 | 3.18 | |
SCT-Distil-model-phayathaibert | 58.32 | 67.42 | 3.21 | |
SCT-Distil-model-phayathaibert-bge-m3 | 78.37 | 84.01 | ||
ConGen-model-XLMR | 60.29 | 68.56 | 3.28 | |
ConGen-model-wangchanberta | 59.11 | 67.42 | 3.19 | |
ConGen-model-phayathaibert | 59.24 | 67.69 | 3.15 | |
ConGen-BGE_M3-model-phayathaibert | 83.36 | 88.29 | 3.14 | |
distiluse-base-multilingual-cased-v2 | 32.50 | 42.20 | ✔️ | 2.05 |
paraphrase-multilingual-mpnet-base-v2 | 54.39 | 63.12 | ✔️ | 3.16 |
BGE M-3 | 89.12 | 93.43 | ✔️ | 20.87 |
Cohere-embed-multilingual-v2.0 | 85.45 | 90.33 | ✔️ | XXX |
Acknowledgments:
- Can: proofread
- Charin: proofread + idea